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Abstract 


This work presents and analyzes three convolutional neural network (CNN) 
models for efficient pixelwise classification of images. When using convolu- 
fional neural networks fo classify single pixels in pafches of a whole image, 
a lof of redundanf compufafions are carried ouf when using sliding window 
nefworks. This sef of new archifecfures solve fhis issue by eifher removing 
redundanf compufafions or using fully convolutional archifecfures fhaf inher- 
enfly predicf many pixels af once. 

The implemenfafions of fhe fhree models are accessible fhrough a new ufilify 
on fop of fhe Caffe library. The ufilify provides supporf for a wide range 
of image inpuf and oufpuf formafs, pre-processing paramefers and mefhods 
fo equalize fhe label hisfogram during framing. The Caffe library has been 
exfended by new layers and a new backend for availabilify on a wider range 
of hardware such as CPUs and GPUs fhrough OpenCL. 

On AMD GPUs, speedups of 54 x (SK-Nef), 437 x (U-Nef) and 320 x (USK- 
Nef) have been observed, faking fhe SK equivalenf SW (sliding window) nef- 
work as fhe baseline. The label fhroughpuf is up fo one megapixel per second. 

The analyzed neural nefworks have disfincfive characferisfics fhaf apply dur¬ 
ing framing or processing, and nof every dafa sef is suifable fo every archifec- 
fure. The qualify of fhe predicfions is assessed on two neural tissue data sets, 
of which one is fhe ISBI 2012 challenge dafa sef. Two differenf loss functions. 
Mails loss and Soffmax loss, were used during framing. 

The whole pipeline, consisting of models, inferface and modified Caffe library, 
is available as Open Source soffware under fhe working fifle Project Greentea. 
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Chapter 1 


Introduction 


1.1 Convolutional Neural Networks 

Convolutional neural networks are forward-backward neural networks that are 
mostly based on convolutions with machine learnable kernels, pooling operations 
and element-wise non-linear activation functions. The networks can be employed 
for various image classification and object recognition tasks. A prominent example 
is the ImageNet / AlexNet Q for object recognition. Recent networks Q can have 
very many, in this case over 20, layers and millions of learnable parameters. 

This work is focused on classifying biomedical data, in particular neural tissue 
electron microscopy images (see Chapter]^. The challenge with this kind of data 
sets is that training data is more scarce than with data sets that can be generated 
from everyday pictures such as handwritten letters or online collections of images. 
Annotating ground truth for neural tissue images is a lot of manual work, as every 
single pixel has to be labeled. 

Consequently, improving training speeds is not a primary objective to optimize 
for. The data that has to be processed with a trained model afterwards however 
can easily reach terabyte-scale. It is therefore crucial to develop networks that 
are as efficient as possible in the forwarding step. This work presents three such 
efficient pixel classification networks. 

When training, the classical mislabeling objectives such as Softmax or Cross-entropy 
loss might not be the most useful for pixelwise classifications of biological images, 
and using spatial context information to generate an error signal to train the neural 
networks can perform better. Therefore, this research also considers different train¬ 
ing methods including the Malis ||^ criterion. 

To relate the objectives of this technical report with the title, it needs to be dissected 
into its components: 

• Efficient Convolutional Neural Networks means the network models analyzed 
and designed are as efficient as possible in getting the task done - in this 
case, pixelwise classification of electron microscopy neural tissue images. 

• Pixelwise Classification, as opposed to image classification, aims to propose a 
label to each pixel in a given image. It can also be seen as many separate 
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1.2. Caffe Library 


image classifications of small patches in a bigger image. This gives rise to 
new optimization possibilities as the contexts for the predictions overlap 
spatially. 

• Heterogeneous Hardware Systems means the network models used should also 
run as efficient as possible on a variety of compute devices. This objective 
makes efficient neural networks more accessible to users and allows to use 
existing hardware and clusters to get segmentation tasks with neural net¬ 
works done. 


1.2 Caffe Library 

Caffe stands for Convolutional Architecture for Fast Feature Embedding Q. It 
is a state-of-the-art neural network library that has been heavily optimized for 
the use with nVidias CUDA technology. In many cases, the library is therefore 
already very efficient using certain GPUs. What was missing until now Q is fast 
CPU support (the current CPU backend is mostly single threaded) and support 
for GPUs and accelerator devices from AMD and Intel. The library is still under 
development and has a large community Q. 

Network models and trained weights (usually called model zoo) can be shared 
in Google's prototxt (network and learning configurations) and protocol buffer 
(trained weights and solver states) format. 

The library is typically used on the command line with the Gaffe binary or through 
a python (Pycaffe) interface (see Figure |1.3| |. For more advanced and intrusive 
interfaces, G++ interfaces can be programmed on top of the library. 

All models, utilities and backend additions programmed for this project are based 
on and around Gaffe. The changes to the library are documented in Ghapter|^ 


1.3 Pixelwise Classification 


Pixelwise classification means labeling each pixel in an image based on a local 


context around the pixel. Figure 1.1 shows how this works with sliding window 
networks that outputs a single pixel per input tile of size v + w = 101 -|- 1 = 102. 

While minibatch processing can output many pixels at once, this is still inefficient 
(see Sections 3.2 and |6.5|. The work by Hongsheng Li et al. |0 allows to make 


existing SW networks more efficient while giving identical prediction results (see 
Section |3.3.1| |. 

Alternatively, fully convolutional models (U and partially also USK) directly out¬ 
put a bigger patch, as depicted in Figure |1.2[ This method of training and pro¬ 
cessing is called patch-based {n = 1, w > 1), as opposed to minibatch-based 
{n > 1, w = 1). A combination of both (n > 1, w > 1) is possible but only useful 
when the images in the data set can not be tiled with large w 3> 1 (see Section 


6.41. 


Minibatches can still have advantages during training (see Section 4.31 because 
every element in the minibatch can be picked independently. During processing, 
networks that output large patches are always performing better. 
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1.3. Pixelwise Classification 


In all cases, the images have to be extended (padded) by mirroring on the borders 
by I pixels on each side if every pixel of the image is to be labeled. 




(b) Generated minibatch input of size n = 4 
and with a context of 102 by 102 pixels. The 
output classification for this input will be 4 
by 1 pixels. The individual images in the 
minibatch are only shifted by one pixel each. 
The data overlaps and is copied into the 
network redundantly. 


(a) Input image to classify with border 
mirroring to extend the image. The green 
rectangle is a 4 by 1 pixel area to be labeled. 


Figure 1.1: Pixelwise image classification based on sliding window architectures. 
(Raw image source: ssTEM ||^, Q). The surrounding context (blue rectangles) is 
what determines the labeling decision of the neural network. 



(a) The green area to be labeled is 128 by 128 
pixels {w = 128). The green patch with the 
blue context padding {v = 101) is directly 
what the networks take as input. 



(b) The data is passed through the network 
as large a tile of size w -\-v hy w + v instead 
of a minibatch. There is no overlapping data 
being passed through the network 
redundantly and no duplicated convolution 
and pooling operations are carried out. The 
output prediction is a large patch {w = 128) 
instead of a stride of pixels from a 
minibatch (as in Figure [TT|. 


Figure 1.2: Pixelwise image classification based on strided kernel and fully 
convolutional architectures. (Raw image source: ssTEM 0, 0). 
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1.4. Existing Work 


1.4 Existing Work 


This technical report is based on the following existing work: 


SW (sliding window) network designed by Julien Martel |10|, not published. 
The architecture is trimmed for segmenting the data set DSl. 

Strided kernel convolution and pooling kernels by Hongsheng Li et al. 0. 
This is the fundamental approach in speeding up existing SW networks. 


• Malis criterion, first introduced by Srinivas Turaga et al. ||^. The criterion sup¬ 
ports an alternative way of training neural networks through affinity graphs, 
which is very specific and useful on biomedical data, where areas are separ¬ 
ated by background borders. 


• The Open Source Caffe library maintained by the Berkeley Vision and Learn¬ 
ing Center ||^, ||4|. 

• U network designed by Ronneberger et al. Q. This model is also optimized 
for biomedical images and especially the data set DS2 (ISBI 2012 pT|). 


• N-dimensional convolution kernels by Jeff Donahue 


• Segmentation evaluation scripts of the ISBI 2012 challenge 


|11|, |13|. 


1.5 New Contributions 


An overview of new contributions to the Caffe landscape in terms of models, utilit¬ 
ies and library changes is given in Ligure |1.3[ This work tackles the given problem 
on all levels - from using efficient BLAS libraries over backend development and 
frontends for easy use to new network models. 

On the side of neural network models, this project introduces two new neural 
network architectures, the SK-Net (Section and USK-Net (Section [T^ . 

A meta-analysis of the three efficient networks (SK, U, USK) is given based on: 

• Differences and characteristics of the network designs (Chapter]^. 

• Computational cost and efficiency (Chapter]^. 

• Image segmentation quality, assessed both numerically and visually (Chapter 
0 for the typical foreground-background two label classification. 

In order to be able to train the network models easily on various data sets, the 
Caffe Neural Tool (Chapter]^ has been developed. 

The Caffe library (see Chapter]^ has been extended with new layers and adaptions 
for compability with the new backend. The new layers also affect the functional¬ 
ity of the CUDA backend. OpenCL backend development (see Section |5.4| was 
mostly focused on versatility and completeness, so that CPUs and all kinds of 
compute devices can be used on all network models. This includes compability to 
three different BLAS libraries: clBLAS, ViennaCL-BLAS and cBLAS. 
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1.5. New Contributions 
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Figure 1.3: Project Greentea overview. Green boxes denote completely new 
additions to the Gaffe landscape. Red boxes are parts that have been 
re-implemented or adapted from existing work to fit the needs of this project. 
Blue parts are mostly unchanged from existing work. Dashed arrows denote 
deprecated and only partially supported combinations. 


This report is also providing an introduction into segmentation and pixelwise clas¬ 
sification with neural networks. It contains all the details necessary to understand 
existing models and readers should be able to easily design their own neural net¬ 
works based on the findings of this research project. 

The combination of the new contributions and the Gaffe library infrastructure 
is summarized under the working title Project Greentea. However, Greentea also 
stands for the new OpenGL backend architecture which has been optimized for 
high flexibility (see Section [s!^ . 

Greentea was a name of my choice because I simply prefer greentea over coffee 
(Gaffe). Greentea can not be used as an abbreviation like it is the case with Gaffe, 
but as greentea is supposed to be good for the brain, using it for a neural network 
machine learning toolset seems to be appropriate. 
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1.6. Terminology 


1.6 Terminology 

The most used symbols and abbreviations in the report: 

Forwarding, processing: Computing data through a neural network from input 
to output. 

Backwarding, training, backpropagation: Computing neural network gradients 
(diff maps) in the backward direction and updating the network weights. 

Data blob: Memory blob containing feature maps of forward processing in 
the neural network. 

Diff blob: Memory blob containing the differential / error signal map during 
backpropagation. 

BLAS: Basic Linear Algebra Subprograms. Includes functions such as effi¬ 
cient matrix multiplications. 

DSl: Data set 1, see Section [TT] 

DS2: Data set 2, see Section 


SW: Sliding window networks for pixelwise classification, see Section 3.2 


SK: Strided kernel networks for pixelwise classification, see Section 3.3 
U: Ronneberger et al. Q network architecture, see Section 


USK: Network architecture combining SK and U aspects, see Section |3.5[ 

/: Number of feature maps after (/out) and before (/n) a network layer. 

w: Size (in each dimension) of a feature map in a network layer. When not 
indexed or otherwise noted, it refers to the output size of a network layer. 

v: Size of the total network input padding (context). 

p: Network layer padding. 

s: Network layer stride. 

k: Network layer kernel size. 

d: Network layer kernel stride. 

L: Set of layers with layers I ^ L. 

B: Set of memory blobs with blobs b ^ B. 

W: Set of network weights, |W| denotes the number of weights. 

M: Device or host memory usage. 
q: Number of queues in the Caffe OpenCL backend. 
n: Network minibatch size. 

A: Affinity graph data, A A denotes the affinity graph diff. 

I: Pixel image data, AI denotes the image diff. 

Some symbols are used differently in some sections of the report and are explained 
in-place. 
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Chapter 2 


Datasets 


2.1 DSl - Segmented anisotropic ssTEM dataset of neural 
tissue 



Figure 2.1: DSl ssTEM raw image, 512 by 512 pixels of image 2 (right upper 
corner) (Source: ssTEM 0, 0). 


This data set shows neural tissue from a Drosophila larva ventral nerve cord and 
was acquired using serial section Transmission Electron Microscopy at HHMI Jane- 
lia Research Campus 0. The training data consists of 20 images of 1024 by 1024 
pixels raw ssTEM and the corresponding segmentation. It is segmented into nine 
different labels, which are consolidated into foreground and background for two 
label training and evaluation (see Section [7!^: 
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2.1. DSl - Segmented anisotropic ssTEM dataset of neural tissue 


• #0: Horizontal cell membranes - background 

• #1: +45° to vertical cell membranes - background 

• #2: Vertical cell membranes - background 

• #3: -45° to vertical cell membranes - background 

• #4: Cell membrane junctions - background 

• #5: Glia cells - background 

• #6: Mitochondria - foreground 

• #7: Synapses - background 

• #8: Cell interior - foreground 


The idea behind label consolidation in this way is that the network, during train¬ 
ing, can learn the separation borders between neural cells. This is especially im¬ 
portant with the Malis criterion loss (see Section 5.3.21, which can only segment 
into foreground and background. With the Softmax loss, it is also possible to let 
the network learn the labels separately and combine them accordingly afterwards. 
The network has to learn the features separately either way, as the membrane for 
example depends on different orientations in the convolution filters. This means 
a network can be trained on two labels only and afterwards, all nine labels can be 
extracted by applying a short fine-tuning training phase to the network. 



(a) All 9 labels, the blue mitochondria and (b) Consolidated labels, 

white cell interiors are the foreground. background-foreground segmentation. 


Figure 2.2: DSl ssTEM label images, 512 by 512 pixels of image 2, corresponding 
to the raw image in Figure [2d] (Source: ssTEM 0, 0). 


For evaluation, the training data set has been split into training and testing data 
because of the lack of a segmented test data stack: 

• Train images: 0,1, 3, 4, 5, 6, 8, 9,10,11, 13, 14, 15, 16, 18 and 19 
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2.2. DS2 - ISBI 2012 dataset of neural tissue 


• Test images: 2, 7, 12 and 17 

As all slices of the data set are very similar, cross validation by splitting the data 
set in different ways was not applied. The total amount of pixels for training is 
therefore 16 ■ 1024^ « 16 Mpixel. 

The data set with the corresponding test and train scripts is available in the Caffe 
Neural Models repository 0 as dataset_01. 

2.2 DS2 - ISBI 2012 dataset of neural tissue 



Figure 2.3: DS2 raw image, 512 by 512 pixels of training image 1 (Source: ISBI 

challenge |^, 0). 


This data set is from the ISBI 2012 challenge pT| , p^ , |15|. The raw images and 
corresponding segmentation images for training are 512 by 512 pixels. The neural 
tissue features have a similar scale to the data set DSl. The raw images have a bit 
less contrast and are more fuzzy. 


The training set has 30 images, which gives a total of 30 ■ 512^ 
is about half as much as on DSl. 


7.8 Mpixel, which 


The test data used on DS2 is a separate stack of 30 images of size 512 by 512 pixels. 
For those images, segmentation ground truth is not available for public download. 


thus the evaluation reported in Section 7.3 is solely based on the reports of the 
official ISBI 2012 evaluation pT| , which is still open for new results. 

For both stacks, the data spans 2 x 2 x 1.5 microns with a resolution of 4 by 4 by 


50 nm/pixel |11| 
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2.2. DS2 - ISBI 2012 dataset of neural tissue 



Figure 2.4: DS2 label image, 512 by 512 pixels of training image 1, corresponding 
to the raw image in Figure (Source: ISBI challenge 0). 


The data set with the corresponding test and train scripts is available in the Caffe 
Neural Models repository 0 as datasetD2. 
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Chapter 3 


Models 


3.1 Introduction 


This chapter describes how the network architectures were set up for training 
and processing the data sets DSl and DS2. They were configured for two label 
classification, however also nine labels were tested and even more could be learned 
through the Softmax loss. With Malis loss (Section 5.3.2| , only foreground and 
background separation is implemented. 


For all networks architectures, no special striding or padding was used. Therefore 
the striding parameter is set to be s = 1, except for downsampling and sliding 
window pooling operations, which use s = k (stride matches kernel size). The 
padding is always p = 0. All networks use square size filters and feature maps, 
which simplifies the descriptions to just one value per parameter. The models can 
be generalized to arbitrary dimensions with different sizes in each dimension. At 
the time of the project, SK networks up to 6 dimensional can be configured with 
the modified Caffe library provided |[^. 

The mini batch size used for processing in the SW network is n = 256, but the 
choice is arbitrary and only limited by GPU memory. With SK, U and USK net¬ 
works, a mmibatch size of one (n = 1) is sufficient to reach 100% GPU utilization, 
and often the GPU memory is not sufficient for bigger mini batch sizes (see Section 


if the data set allows it by having big enough input pictures. 


6.41. It is rather useful to increase the network output size than using minibatches. 


All efficient networks presented here (excluding the sliding window) can be run 
with almost any size of output prediction maps. Only constraints on even divisib¬ 
ility with pooling operations, as well as the size constraints given by convolutions 
and strided kernels (see Section |3.3| l have to be met. The networks can therefore 
be run on different input image sizes, depending on memory requirements (see 
Section |6.4| and data set image size, without re-design and re-training of the net¬ 
works. The results are numerically identical in this case. 

The total padding (v) of the networks is a characteristic of the network itself and 
can not be changed without re-design and re-training. It describes the amount of 
context considered for each pixel prediction. 
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3.2. Sliding Window (SW-Net) 


To fit the networks to data sets with features of different scales than DSl and DS2 
(see Chapter |^, it may be necessary to adapt kernel sizes and layers to get good 
predictions after training. Here, the networks are configured so that a big mito¬ 
chondrion (about 100 by 100 pixels) would fit into the context of a pixel prediction 
centered on the mitochondrion. 


3.2 Sliding Window (SW-Net) 


Sliding window networks classify an image by taking a pixel and a border padding 
V of some size around it as input and classify the center pixel by running the patch 
through a neural network. Then the next pixel is labeled by shifting the window 
patch by one pixel, classifying the neighboring pixel of the first one. The pixels 
can also be processed in a minibatch to increase GPU utilization and amortize 
direct memory access transfer times (from host to device memory). This is still 
very inefficient as most of the context of two neighboring pixels overlaps and the 
same filters are applied over the whole context. The redundant computations can 
be reduced for a patch of input pixels by using SK networks. 


The sliding window network described here was developed by Julien Martel |10|. 
It is the baseline for calculating the speedups obtained with the SK, U and USK 
networks. The structure of the SW network was also used when designing the SK 
network and the core of the USK network. The reason why this was used as a 
basis is that it has already been trained on the data set DSl and had good results. 
Learning rate, weight decay and training parameters were available. However, no 
numerical evaluation or publication about the network architecture exists. 


Layer 

Type 

w 

/in 

/out 

k 

s 

data 

MemoryData 

100 

3 

3 

1 

1 

convl + relul 

Convolution + ReLU 

94 

3 

48 

1 

1 

pooll 

Max Pooling 

47 

48 

48 

2 

2 

conv2 + relu2 

Convolution + ReLU 

43 

48 

128 

5 

1 

pool2 

Max Pooling 

22 

128 

128 

2 

2 

conv3 + relu3 

Convolution + ReLU 

20 

128 

192 

3 

1 

pool3 

Max Pooling 

10 

192 

192 

2 

2 

ipl + relu4 

InnerProduct + ReLU 

1 

192 

1024 

10 

1 

ip2 + reluS 

InnerProduct + ReLU 

1 

1024 

512 

1 

1 

ip3 

InnerProduct 

1 

512 

2 

1 

1 

prob 

Softmax 

1 

2 

2 

1 

1 


Table 3.1: Network setup for the SW model. 


The sliding window network has not been evaluated in-depth in terms of bench¬ 
marking and quality assessment. This is because the Caffe Neural Tool (see 
Chapter used for detailed benchmarking and processing does not work with 
minibatches in its current form. Its segmentation performance should however 
be numerically equal to the SK network, which is derived from the SW network. 
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3.3. SK-Net 


Any differences would be due to different training methods (patches on SK versus 
minibatches on SW). Information about how single layers speed up from SW to SK 
networks (both theory and experimental) can be found in the work of Hongsheng 
Li et al. 0. 

For weight initialization, the SW network uses random initialization drawn from 
a Gaussian distribution with ji = 0 and a = 0.01. 


3.3 SK-Net 


3.3.1 Converting SW Networks to SK 

Hongsheng Li et al. 0 provide a pseudo code (page 4) on how to convert a slid¬ 
ing window network to a strided kernel network. However, it is incomplete on 
consistency checking, kernel sizes and feature map output sizes. Also, the theory 
of converting inner product (fully connected) layers is not described. Therefore I 
provide a more complete version (see Algorithm 0, although without considering 
padding and striding. This enforces s = 1 and p = 0 in all layers of the SK net¬ 
work. Each data dimension (width, height, depth) can be processed separately for 
the kernel size (k), kernel stride (d) and output dimension (w). The algorithm is 
able to convert networks and find consistency issues fully automatized. 

For Caffe prototxt network configurations, only the kernel size k and kernel stride 
d have to be provided. Output dimensions will be computed on the fly, given the 
network input size . Padding and striding parameters can be left away, they 
will default to the correct values. 

Algorithm assumes that | \ = \ L^k \ = N, not taking into account the input 

data layer, which is at i = 0. 

If a layer type is not handled in a special if-case, it is handled by using the exact 
same configuration as in the original network. This fails however if the layer 
does anything other than an element-wise operation (this implies k = 1), be¬ 
cause of the kernel stride d > 1. During the conversion, the initial input size 
is equal to what the SW network used. Afterwards, an arbitrary input size 
^SK — ^sw used during training and processing, and the output will be of 

Size = rcgj; — -|- 1. 

Using Wg^ = WgH also helps to prove that the results stay numerically the same. In 
this case, the strides introduced by pooling operations will be implicitly ignored. 
They will not be taken into account at the first inner product layer (ipl), which 
will span the whole feature map size, because the external kernel size is 


w 


(!-i) 

SK 



1)^SK + 1 



(3.1) 


and 

_ 1.(0 _ 1.(0 ^(0 _ 1 

“^SW ~ '^SW ~ '^SK “^SW “ ^ 

at that layer. This property can be visualized as well, as in 

first irmer product layer (ipl) has a kernel size of ksK = 3 = 


(3.2) 

Figure |3.1[ where the 

(9) 
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3.3. SK-Net 


For the originally inner product (fully connected) layers in SW-Net, which are now 
normal convolutions, {ipl to zp3) there are no computational savings compared to 
the SW network anymore. This is clear from the observation that the ipl layer 
isolates the context of each pixel and no overlappings in the feature maps exist 
after this layer. 

The number of input (/in) and output (/out) feature maps remains exactly the same 
for SW and SK networks in all layers. 


Algorithm 1 Convert SW-Net to SK-Net 


1 : 

2 : 

3 : 

4 

5 

6 
7 : 
8 : 
9 : 

10 : 

11 : 

12 : 

13 : 

14 : 

15 : 

16 : 

17 : 

18 : 

19 : 

20 : 

21 : 

22 : 

23 

24 

25 

26 

27 : 

28 : 

29 : 

30 : 


procedure Convert 


(0 

SK 


V/ e [1,N].S, 
V/G[l,N].p« 


1 

0 


zv. 


( 0 ) 

SK 


IV. 


( 0 ) 

SW 


“temp ' r 

for i = 1; i < N; i ^ i + 1 do 
if /g^ = convolution then 


‘SK ^ 
■ 

'^SK 

■ 

“SK 
(0 
SK 


convolution SK 

- 

'^sw 


w. 


itemp 

{i-i) 


w. 


SK 


(^SK 1) ■ 


else if /g^ = pooling then 


(0 

SK 


0 TV. 


(0 

SW 


w. 


F-1) 

SW 


(^SW 1) 


It w. 

‘SK ^ 

'^SK 

“SK 


(0 

SK 


(t-1) 


g^ mod kg^ / 0 V kg^ / Sg^ then return error 
pooling SK 

'^sw 


dO 


TO 


w. 


Hemp 

(i-i) 


w. 


SK 


^temp 


(kg/ 1) ■ d. 

dO 


(0 

SK 


> w. 


(0 

sw 


d'-l) 


'^SW 


else if /g^ = inner product then 

convolution SK 

d'-i) 


^temp * 


‘SK ^ 

k(') ■ 

'^SK 

4'd 


w. 


sw 


>k, 


(0 


SW 


= w. 


sw 


is implicit 


re, 


tem-p 

[i-l) 

SK 


(^SK 1) ■ 


else 


if d 


SK 

itemp ^ 1 

if kg’^ > 1 then return error 

^SK ^ hw 
{i) (j—l) 

WSK ^ ^SK 

= 1 then return success 


ii) 

SK 


> ZV, 


(0 

sw 


> zv, 


(0 

sw 


zv. 


(i-l) 

sw 


temp 

else return error 
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3.3. SK-Net 




Figure 3.1: SK (strided kernel) feature maps. The kernel and feature map sizes 
used here are smaller than in the actual SK network in order to fit it into a 
reasonably sized figure. The input size is + 2 and consequently the 

output is 3 by 3 pixels as inferred from Algorithm The red squares trace the 
information flow in the original SW network. Convolutions and their ReLU 
activations are not displayed separately. 


Observations on Algorithm]^ 

• Line 13: Implies only downsamlplmg pooling layers can be converted. Pool¬ 
ing with s = 1 instead of s = k would need to be handled like convolutions 
in terms of kernel and output size, and would not change the kernel stride 
d. This has not been assessed further. 

• Line 17: Caffe can handle downsampling poolings that overlap the border 
in SW networks, which causes implicit zero padding. This is not allowed in 
SK networks, therefore Line 13 checks if w mod k = 0. Continuing without 
this check would cause the new strided pooling layer to overlap on data that 
would normally be separated by a stride, and also output feature maps of 
wrong sizes to continue. 

• Line 18: Using downsampling is the only operation that increases the kernel 
stride. All other operations either keep the context local (convolutions) or 
have k = 1 (element-wise operations). 

• Lines 11,17,23: Interestingly, all converted layer types with kernel sizes now 
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3.3. SK-Net 


decrease the feature map sizes by the same formula. Convolution and inner 
product layers implicitly inherit this behavior. Pooling does this because it 
separates output pixels by a stride equal to the downsampling kernel size k. 

• Line 29: Having a kernel stride d > 1 in the last layer implies pixels in the 
output feature maps are not independent from each other. In this case, the 
network has not been converted correctly and possibly lacks at least one 
inner product layer. 


3.3.2 SK Network 


3.1 


the condition mod kg^ ^ 0 


When processing the network in Table 
given by Algorithm is actually violated by the second pooling layer, having 43 
mod 2 7^ 0. It is easy to fix this by starting at the last layer of the network and 
computing 

(3.3) 


^('■- 1 ) _ 1.(0 ^(0 

“^sw ~ '^sw'^sw 


for pooling layers and 


re, 


{i-i) 

sw 


= (4w - 1) + 


0 

sw 


(3.4) 


for inner product and convolution layers. The corrected network is given in Table 


3.2 Converting this to SK results in the network in Table 3.3 


Layer 

Type 

w 

/in 

/out 

k 

s 

data 

MemoryData 

102 

3 

3 

1 

1 

convl + relul 

Convolution + ReLU 

96 

3 

48 

1 

1 

pooll 

Max Pooling 

48 

48 

48 

2 

2 

conv2 + relu2 

Convolution + ReLU 

44 

48 

128 

5 

1 

pool2 

Max Pooling 

22 

128 

128 

2 

2 

conv3 + relu3 

Convolution + ReLU 

20 

128 

192 

3 

1 

pool3 

Max Pooling 

10 

192 

192 

2 

2 

ipl + relu4 

InnerProduct + ReLU 

1 

192 

1024 

10 

1 

ip2 + reluS 

InnerProduct + ReLU 

1 

1024 

512 

1 

1 

ip3 

InnerProduct 

1 

512 

2 

1 

1 

prob 

Softmax 

1 

2 

2 

1 

1 


Table 3.2: Network setup for the corrected SW model. 
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3.3. SK-Net 


Layer 

Type 

zv 

/in 

/out 

k 

s 

d 

data 

MemoryData 

229 

3 

3 

1 

1 

1 

convl + relul 

Conv. SK + ReLU 

223 

3 

48 

7 

1 

1 

pooll 

Max Pool. SK 

222 

48 

48 

2 

1 

1 

conv2 + relu2 

Conv. SK + ReLU 

214 

48 

128 

5 

1 

2 

pool2 

Max Pool. SK 

212 

128 

128 

2 

1 

2 

conv3 + relu3 

Conv. SK + ReLU 

204 

128 

192 

3 

1 

4 

pool3 

Max Pool. SK 

200 

192 

192 

2 

1 

4 

ipl + relu4 

Conv. SK + ReLU 

128 

192 

1024 

10 

1 

8 

ip2 + relu5 

Conv. SK + ReLU 

128 

1024 

512 

1 

1 

1 

ip3 

Conv. SK 

128 

512 

2 

1 

1 

1 

prob 

Softmax 

128 

2 

2 

1 

1 

1 


Table 3.3: SK network configuration. 


The final three inner product layers from the SW network actually become convo¬ 
lutions with special properties: For ipl, the rules stay the same as for converted 
convolution layers. Afterwards, ipl and ip3 can have an arbitrary kernel stride d 
because the kernel size k only spans one pixel in each feature map. To not cause 
confusion, it should be configured so that d = 1 for those layers. 

The network still has one issue, which is not nice but acceptable and the network 
will still work. The issue is that there is no center pixel, because 

(a;(^=102) mod 2 = 0 (3.5) 


Each patch in the original network has a context of 102 pixels. When converting to 
SK and classifying a patch of 128 by 128 pixels as given in Table 3.3 the padding 
to add in the beginning actually becomes v = 101 pixels. This padding can not be 
split up into a border around the patch to classify. To simplify this issue, a border 
of 51 pixels on each side is assumed and then cropped by one pixel on the bottom 
and right side. This results the same behavior as running a 102 by 102 pixel sliding 
window network across the input, which was also padded with 51 pixels in the 
corrected version and 50 pixels in the original version. 


To estimate the number of free parameters (all convolution weights |W|) that can 
be trained in a model, the following formula is used: 


|W|= E /ta '/Si (3-6) 

/^'ALconv. 


3.3 


Using the values in Table 
(?a 19.6 ■ 10^) are within the ipl layer. 


this gives | W| « 20.5 ■ 10^ parameters, of which most 


For weight initialization, the SK network uses random initialization drawn from a 
Gaussian distribution with p = 0 and cr = 0.01. 
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3.4. U-Net 
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Figure 3.2: SK network configuration visualization. Green-red striped blocks 
represent feature maps with a kernel stride {d > 1). Vertical numbers represent 
the size of the feature maps while the horizontal numbers represent the number 

of feature maps. 


A directed acyclic graph representation of the network can be found in the ap¬ 
pendix |A.1[ 


3.4 U-Net 


The U-Net presented here is the network configuration as described in the Ron- 
neberger et al. paper Q. Table 3.4 describes the network in the same style as Table 


3.3 for the SK network in order to compare them. The layer names are chosen 


in the same style as with SW and SK networks. The U-Net has contracting and 
expanding sections: 


• Contracting: Two convolutions followed by one max pooling layer. 

• Expanding: Deconvolution followed by a convolution to reduce the number 
of feature maps, a mergecrop and two convoluton layers. 

The source code for running U-Net as well as the prototxt configuration files were 
not available for download at the time of this project, thus the network presented 
here, which is my own interpretation, might differ from the original design. The 
paper does not give all details, such as how the MergeCrop layer and Upconvolution 
work. The configurations of this U-Net included in the Caffe Neural Models Q 
are therefore incompatible to the original work |Q. Segmentation results should 
be comparable. 


Layer 

Type 

w 

/in 

/out 

k 

s 

data 

MemoryData 

572 

3 

3 

1 

1 

convl + relul 

Convolution + ReLU 

570 

3 

64 

3 

1 

conv2 + relu2 

Convolution + ReLU 

568 

64 

64 

3 

1 

pooll 

Max Pooling 

284 

64 

64 

2 

2 

conv3 + relu3 

Convolution + ReLU 

282 

64 

128 

3 

1 
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3.4. U-Net 


conv4 + relu4 

Convolution + ReLU 

280 

128 

128 

3 

1 

pool2 

Max Pooling 

140 

128 

128 

2 

2 

convS + reluS 

Convolution + ReLU 

138 

128 

256 

3 

1 

conv6 + relu6 

Convolution + ReLU 

136 

256 

256 

3 

1 

pool3 

Max Pooling 

68 

256 

256 

2 

2 

conv7 + relu7 

Convolution + ReLU 

66 

256 

512 

3 

1 

convS + reluS 

Convolution + ReLU 

64 

512 

512 

3 

1 

pool4 

Max Pooling 

32 

512 

512 

2 

2 

conv9 + relu9 

Convolution + ReLU 

30 

512 

1024 

3 

1 

convlO + relulO 

Convolution + ReLU 

28 

1024 

1024 

3 

1 

upconvl 

Deconvolution 

56 

1024 

1024 

2 

2 

convll 

Convolution 

56 

1024 

512 

1 

1 

mergecropl 

MergeCrop 

56 

512 + 512 

1024 

1 

1 

convl2 + relull 

Convolution + ReLU 

54 

1024 

512 

3 

1 

convl3 + relul2 

Convolution + ReLU 

52 

512 

512 

3 

1 

upconv2 

Deconvolution 

104 

512 

512 

2 

2 

convl4 

Convolution 

104 

512 

256 

1 

1 

mergecrop2 

MergeCrop 

104 

256 + 256 

512 

1 

1 

convlS + relul3 

Convolution + ReLU 

102 

512 

256 

3 

1 

convl6 + relul4 

Convolution + ReLU 

100 

256 

256 

3 

1 

upconv3 

Deconvolution 

200 

256 

256 

2 

2 

convl7 

Convolution 

200 

256 

128 

1 

1 

mergecrop3 

MergeCrop 

200 

128 + 128 

256 

1 

1 

convlS + relulS 

Convolution + ReLU 

198 

256 

128 

3 

1 

convl9 + relul6 

Convolution + ReLU 

196 

128 

128 

3 

1 

upconv4 

Deconvolution 

392 

128 

128 

2 

2 

conv20 

Convolution 

392 

128 

64 

1 

1 

mergecrop4 

MergeCrop 

392 

64 + 64 

128 

1 

1 

conv21 + relul7 

Convolution + ReLU 

390 

128 

64 

3 

1 

conv22 + relulS 

Convolution + ReLU 

388 

64 

64 

3 

1 

ipl 

Convolution 

388 

64 

2 

1 

1 

prob 

Softmax 

388 

2 

2 

1 

1 


Table 3.4: U network configuration. 


The U-Net architecture has |W| « 29 ■ 10“ parameters, using Equation 3.6 and 


the corresponding values in Table |3.4| Thus there are about 30% more parameters 
than in SK-Net. The number of weights rises towards the middle of the U network, 
due to having the same convolution kernel size (3 by 3) throughout the network 
and more feature maps with every contracting step. 


For weight initialization, the U network uses random initialization drawn from a 
Gaussian distribution with ji = 0 and a = y/l! (/in • k^). This is the same as used 
in the original paper Q. With a = 0.01 as in SK and SW models, the network 
could not be trained properly. 
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3.5. USK-Net 


The upconvolutions are set to nearest neighbor interpolation, which causes each 
pixel in the /in maps to fill exactly four pixels (two by two) in the /out maps. 
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Figure 3.3: U network configuration visualization. Vertical numbers represent the 
size of the feature maps while the horizontal numbers represent the number of 

feature maps. 


Figure |3.3|r epresents the network in a style identical to the one used by Ronneber- 
ger et al. 0. A directed acyclic graph representation of the network can be found 
in the appendix |A.2[ 

3.5 USK-Net 

The USK network architecture combines ideas from the U and SK models. The ma¬ 
jority of the convolutions and therefore free parameters can be trained on down- 
sampled feature maps by using one or more contracting path steps (and their 
expanding counterpart) from the U-Net. 

After one contracting step, the same sequence of layers as in the SK network is 
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3.5. USK-Net 


applied, however with different kernel sizes to match the sizes of the inputs and 
outputs as required by the U subnetwork. The network has been configured so 
that: 


• The network outputs 512 by 512 pixel labels. 

• The context considered for each pixel classification is 180 by 180 pixels. 

• Asa result, the network input is 692 by 692 pixels. 

• The SK network part {conv3 to ip2) is required to accept 344 by 344 pixels as 
input and outputs 258 by 258 pixels. It therefore sees a context of 86 pixels. 
This is on once by a factor of two downsampled feature maps. 


Prior experiments with two contracting and expanding steps were not very suc¬ 
cessful, as many features of both datasets (DSl and DS2) vanished at two times 
downsampling and the network also became much harder to train because it gets 
very deep. The network mainly relied on the filter results in the two contracting 
and expanding step pairs (U subnet) to classify the given input. The signals com¬ 
ing from the SK subnetwork were largely ignored by setting their weights close to 
zero in the convolution following after the first mergecrop layer. 

Using only one downsampling step fixed this issue and the SK subnetwork con¬ 
tributed properly after training. As the SK subnetwork contains the main com¬ 
putational costs (see Figures 6.9 and 6.10 in Section 6.6.3| and also carries most 
parameters, the features have to be meaningful enough at the beginning of the 
subnetwork, after the first few layers of the U subnetwork. 


Layer 

Type 

w 

/in 

/out 

k 

s 

d 

Subnet 

data 

MemoryData 

692 

3 

3 

1 

1 

1 

U 

convl + relul 

Conv. + ReLU 

690 

3 

64 

3 

1 

1 

U 

conv2 + relu2 

Conv. + ReLU 

688 

64 

64 

3 

1 

1 

u 

pooll 

Max Pooling 

344 

64 

64 

2 

2 

1 

u 

conv3 + relu3 

Conv. SK + ReLU 

339 

64 

128 

6 

1 

1 

SK 

pool2 

Max Pool. SK 

338 

128 

128 

2 

1 

1 

SK 

conv4 + relu4 

Conv. SK + ReLU 

332 

128 

128 

4 

1 

2 

SK 

pool3 

Max Pool. SK 

330 

128 

128 

2 

1 

2 

SK 

conv5 + relu5 

Conv. SK + ReLU 

318 

128 

128 

4 

1 

4 

SK 

pool4 

Max Pool. SK 

314 

128 

128 

2 

1 

4 

SK 

ipl + relu6 

Conv. SK + ReLU 

258 

128 

512 

8 

1 

8 

SK 

ip2 + reluZ 

Conv. SK + ReLU 

258 

512 

256 

1 

1 

1 

SK 

upconvl 

Deconv. 

516 

256 

256 

2 

2 

1 

U 

conv6 

Conv. 

516 

256 

128 

1 

1 

1 

u 

mergecrop 1 

MergeCrop 

516 

128 + 64 

192 

1 

1 

1 

u 

convZ + relu8 

Conv. + ReLU 

514 

192 

128 

3 

1 

1 

u 

conv8 + relu9 

Conv. + ReLU 

512 

128 

64 

3 

1 

1 

u 

ip3 

Conv. 

512 

64 

2 

1 

1 

1 

u 

prob 

Softmax 

512 

2 

2 

1 

1 

1 

u 


Table 3.5: USK network configuration. 
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3.5. USK-Net 


The USK-Net architecture has |W| 


the corresponding values in Table 3.5 


5.5 • 10^ parameters, using Equation |3.6| and 
This is a fraction (about 25%) of the SK- 


and U-Net weights. The savings mainly come from reducing the ipl layer, which 
now only has ~ 4.2 • 10^ weights. While the ipl layer is still the most expensive 
one, the network is more balanced than SK-Net. The inner product layers are 
less important, because the U subnetwork merges and convolves the feature maps 
from the beginning of the network together with upsampled signals from the ipl 
layer. Figure |3^ displays the balanced feature maps on the two processing paths. 
Furthermore, the USK-Net inherits the advantage of having bigger kernel sizes 
than only 3 by 3 (U-Net), going up to 8 by 8 in the ipl and 6 by 6 in the convl layer. 
This means the USK-Net can learn features with looking at a bigger context inside 
the feature maps, while still almost reaching the speed of U-Net (see Section |6.5| 
in forward processing. 
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Figure 3.4: USK network configuration visualization. Green-red striped blocks 
represent feature maps with a kernel stride {d > 1). The U subnetwork blocks are 
just displayed red because the kernel stride does not apply there. Vertical 
numbers represent the size of the feature maps while the horizontal numbers 
represent the number of feature maps. 
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3.5. USK-Net 


With less weights and less layers (less depth) than the U-Net, it is easier to train 
and also gave better segmentation results on the two evaluation data sets (see 
Chapter 0. 

The USK network uses random Gaussian weight initialization with /r = 0 and v = 
0.01 for the SK subnet and (7 = y^2/ (/in • k^) for the U subnet. Other initializations, 
such as (7 = 0.01 for the whole network caused the network to either disable 
many neurons (causing feature maps with zero activation) or stopped the loss 
from decreasing early during training. 

A directed acyclic graph representation of the network can be found in the ap¬ 
pendix |A.3[ 
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Chapter 4 


Caffe Neural Tool 


4.1 Functionality 

As the standard Caffe binary |[^ does not support training with patches, a new 
interface had to be written on top of the Caffe library. One option is to use the 
Pycaffe interface with custom python code to load images and pre-process them 
for patch training and processing as depicted in Figure 

I chose to implement a C++ interface similar to the original binary, because OpenCV 
and OpenMP can be used for efficient, parallelized algorithms to preprocess raw 
and label images. 

The Caffe Neural Tool can be configured for training, processing and bench¬ 
marking with a prototxt file similar to the configuration files used to set up net¬ 
works and solvers (learning configurations) in Caffe. The two most important 
functionalities are histogram equalization during training and the image prepro¬ 
cessor, which can prepare the label and raw images in various ways before filling 
up the neural network. 

Template configurations for training and processing on the data sets DSl and DS2 
as well as benchmark scripts for U-, SK- and USK-Net are available in the Caffe 
Neural Models repository 0. 

For training, the following parameters have to be provided: 

• A Caffe solver prototxt configuration which contains parameters such as 
learning rate, weight decay, solver method and the network to use for train¬ 
ing. 

• The padding size (z?), network output (patch) size (w), network input (/in) 
and output (/out) feature map count. 

• A folder with raw images and a folder with the corresponding label images, 
which are matched by alphabetic order. 

• The preprocessor configuration block, including histogram equalization set¬ 
tings. 


1.2 
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4.2. Preprocessing 


• Optionally, a solverstate file to resume a training. The already learned 
weights and current learning rate will be loaded instead of the initial net¬ 
work configuration. 

For processing, the following parameters have to be provided: 

• A Caffe network prototxt configuration to use for processing. 

• A caffemodel file, containing the trained network weights. 

• A folder with raw images and the segmentation output options (file type, 
pixel format and folder). 

• The padding size (v), network output (patch) size (w), network input (/in) 
and output (/out) feature map count. 

• The preprocessor configuration block (without histogram equalization). 

• Optionally, all memory blobs in the network can be stored during processing. 
This feature is called/7fer out-put and is useful to check if the network learns 
the right convolution filters. 

The tool also supports a variety of input and output image formats: Normal JPEG, 
PNG, TIP and BMP files as well as TIP image stacks (multiple images in one file) 
and 32 bit floating point TIP instead of integer pixel values. 

When benchmarking, the tool re-uses existing training and processing configura¬ 
tions, but only fills the network with random data. It will report memory usage, 
layer wise forward and backward times and the total processing time of the net¬ 
work. Por convolution layers, it will also estimate and store the computational 
complexity. This was used to generate the results in Ghapter 


4.2 Preprocessing 


The preprocessing options available are: 


Pabel consolidation, allowing to combine multiple labels into one. This tech¬ 
nique was used on the data set DSl (see Section 2.11. The consolidation is 
applied after histogram equalization, allowing to balance out difficult and 
rare labels before consolidating to only background and foreground. This 
also allows to mark important small, difficult features in the training data. 

Rotation of the training patches to a random multiple of 90°. 


• Random mirroring of the training patches. 

• Blurring training patches with a Gaussian kernel of any size. The blurring 
has a zero mean and the variance is picked at random from a normal distri¬ 
bution. The mean and variance of the distribution can be selected. 


• GPAHE (contrast limited adaptive histogram equalization), with a clipping 
parameter. The function is integrated with OpenGV. 

• Patch normalization to [—1.0,1.0] in floating point before feeding the neural 
network. 


25 




4.3. Histogram Equalization 


There are a few reasons why arbitrary rotation is not available: 

• Interpolating to any angle that is not a multiple of 90° causes aliasing of 
training patches (labels and raw images). 

• Arbitrary rotations make the training patch smaller due to corner cutting. 
This means there are less available total patches and the computation of 
histogram equalization methods gets much more complicated. The assump¬ 
tions about the patch prior and label posterior distributions do not hold 
anymore. 

• It is questionable how useful the additional training data would be. Elastic 
deformations would have more potential in generating unique new training 
data for biological images such as neural tissue EM images of DSl and DS2. 


4.3 Histogram Equalization 

Histogram equalization is a technique to balance out the frequency of labels that 
the network sees during training. As training with patches leads to a set of de¬ 
pendent pixels which are in close proximity on an image (see Eigure |1.2| , the 
training results can be worse than with minibatches. With minibatches (see Eigure 


draw n independent pixels to train with in each stochastic gradient descent step. 

The first equalization approach is a patch prior, which will prefer patches with 
rare labels. This is done by comparing the label distribution within each patch to 
the total label frequency in all training images. 


1.11, it is possible to create a database of single pixel labels and the minibatch can 


n-l Q. . 

!=0 

(4.1) 

m—1 


E n ■ 

(4.2) 

HO 



Eor patches j = 0,... ,m — 1 and labels f = 0,...,n — 1, Equation |4. 1 1 calculates the 
weight for each patch (ry) based on the total label distribution c, and the frequency 
within each patch (fly,,)- Equation 4.2 calculates the label posterior distribution c. 


based on the patch weights ry and the label frequency within the patch. Z, is a 


normalization factor to get c, = 1. 

The method does help if there are patches that have rare labels, because those 
patches will be drawn at random with a higher probability than others. An ex¬ 
ample is the synapse label (number 7) (see Eigure 4.1 and Table |4.1| |. It cannot 
balance the labels which have a similar distribution in every patch - for example 
cell membranes versus cell interior. 


When the patch size gets bigger and approaches the size of the training images, the 
label distribution after the patch prior approaches the original label distribution. 
Thus the patch prior only works with small training patches. It can also completely 


26 






4.3. Histogram Equalization 


equalize the histogram when using a single pixel as patch size {w = 1), as this is 
the same situation as with independent pixels. 

When calculating label frequencies C;, it is taken into account that pixel labels 
closer to the border of the image are covered by less patches than those in the 
center of the image. A patch can start and end at any offset inside the image. 
Corner pixels for example are only covered by the one patch that starts in that 
corner. 

After applying the patch prior, the new label distribution c, is depicted in Figure 
14.11 with values from Table l4d1 


Method 

0 

1 

2 

3 

4 

5 

6 

7 

8 

None (C;) (%) 

2.5 

3.0 

3.5 

3.3 

5.1 

3.0 

5.5 

0.6 

73.6 

Patch prior (c,) (%) 

2.9 

3.4 

4.0 

3.7 

6.1 

6.3 

7.3 

1.9 

64.3 

Masking (%) 

11.1 

11.1 

11.1 

11.1 

11.1 

11.1 

11.1 

11.1 

11.1 


Table 4.1: Label frequency using different histogram equalization techniques. 



(a) No equalization (c/). The majority of 
labels is underrepresented. 


(b) Patch prior (c,) on 64 by 64 pixel patches 
{w = 64). 


Figure 4.1: Label frequency using the patch prior. Label 8, the cell interior, gets 
slightly reduced while especially glia cells (5), mitochondria (6) and synapses (7) 
are represented stronger. Label numbers correspond to DSl (see Section [ZT||. 


The second method masks pixels randomly in each patch, where the random func¬ 
tion is thresholded by the inverse frequency of the label. This means a pixel of 
label type i gets masked (removed) from the error map if: 

• (minc,)“^ < p (4.3) 

i 

In Equation |4.3[ p G [0,1] is a random value picked at uniform and C; G [0,1] the 
label frequency for label i = 0,...,n — 1. Less frequent labels are less likely to 
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4.3. Histogram Equalization 


be masked out, while membrane and cell interior labels, which are very common, 
get masked with a high probability. The result is a completely balanced label 
histogram (see Table |4.l| . The least frequent label consequently never gets masked. 

The best solution to train on an exact label distribution as desired remains to do 
minibatch training. This is theoretically also possible with SK-, USK- and U-Net, 
but very inefficient in terms of training speed. 

Minibatch training is the standard for most networks, also for SW-Net. The net¬ 
work weights are updated every time after a patch or minibatch has run through 
forwarding and backwarding. The gradients get accumulated over all error pixels 
in the Softmax loss layer and are then normalized for stable training. 


Having a balanced label distribution is mostly important with Softmax and multi 
label classification. When doing background-foreground separation with Malis, 
the patch prior has little to no effect and masking can not be used at all. Malis does 
already focus the error on problematic zones globally over the whole patch that 
runs through the network (see Section 5.3.2| and therefore the label distribution 
during training does not matter. 
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Chapter 5 


Caffe Library 


5.1 Introduction 


Adapting the Caffe library for efficient pixelwise classification on heterogeneous 
hardware contains the most programming work in the scope of the project, chan¬ 
ging 20'000 lines of code compared to the BVLC master branch ||^, ||^. 


The changes can be grouped into adaptions on different levels: 

• Modified solver and network code to support the Caffe Neural Tool C++ 
interface. 

• Modified existing layers to fit the pixelwise inputs and outputs. 

• Additional layers for SK, U and USK architectures. 

• Additional layers for the Mails loss. 

• Redesign of N-dimensional layers to support up to 6D convolutions and max 
pooling with strided kernels. 

• OpenCL and OpenCL hybrid code for supporting a wide range of GPUs and 
CPUs. 

• Backend adaptions to allow dynamic backend dispatching at run- and compile¬ 
time. 

• Adapted GNU Makefile and GMake build infrastructures to support the new 
OpenGL backend. 

The changes are implemented in a way that does not break backward compability 
to the original library. All existing network models and trained networks can still 
be used. The source code remains highly maintainable and was ahead of the Gaffe 
BVLG branch |j^ during the whole scope of the project. 
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5.2 Modified Layers 

5.2.1 SK Layers 

SK (strided kernel) layers are layers with a kernel size k > 1 and an inner stride in 
the kernel {d > 1). The result is a kernel that looks at a feature map in a context 
of (k — 1) • d + 1 pixels, which is also called the external kernel size. 

The motivation to have such kernels is to be able to convert single pixel predic¬ 
tion networks (sliding window (SW) networks), to patch prediction networks (SK). 
How this works is visible in the Figures |1.1| and |1.2} as well as the strided repres¬ 
entation in Figure |3.1[ 

For SK convolutions, the matrix multiplication (see Section |5.5| stays the same as 
with normal convolutions. Only the imlcol and col2im memory copy kernels have 
to be adapted. For this, I used the existing kernel codes provided by Hongsheng 
Li et al. 0. 

Changing existing imicol and col2im kernels to support d > 1 is trivial: 

The kernels just read the data by iterating over all input feature maps and dimen¬ 
sions. Within each dimension, the iteration goes over the kernel size k, copying 
data into the convolution buffer. 

With strided kernels, when iterating over the kernel, the reading pointer has to be 
increased by the dimension stride multiplied by the kernel stride d instead of just 
adding the dimension stride. The dimension stride is , starting at; = 0 

for the first dimension and denoting the input feature map size. 

The same iteration scheme applies for pooling operation kernels. All other layers 
do not have to be adapted and work together with SK layers as long as they have 
a kernel size of k = 1 (see Algorithm]^. 

Strided kernels are only implemented in CUDA and OpenCL and do not run as 
native CPU code. 


5.2.2 N-Dimensional Layers 


The Caffe library is able to specify an arbitrary amount of dimensions for the blob 
memory infrastructure used to pass data between layers. However, not all layers 
automatically work in higher dimensions. For most element-wise kernels, nothing 
has to be changed. Convolutions and pooling operations require a slight redesign. 


For convolutions, the matrix multiplication stays again the same as with normal 
and strided kernel convolutions (see Section 5.51. There were existing kernels for 
normal N-dimensional convolutions by Jeff Donahue 1121. However, those kernels 
only support the default kernel stride d = 1. 


In the scope of this research project, at HHMI Janelia, I combined the existing code 
of the strided kernel and N dimensional convolutions to get the most generalized 
form of convolutions, supporting up to 6 dimensions and kernel strides. The 6 di¬ 
mension limit exists because allocating local arrays of dynamic sizes is not allowed 
in OpenCL and CUDA. The arrays are required to store temporary variables such 
as iterators for each dimension. 
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5.3. New Layers 


Derived from the convolution code, I also implemented the max pooling function 
as ND-SK kernel. 

In the case of ID or 2D, the SK and normal convolution layers should be con¬ 
sidered, as looping over two dimensions is more efficient than having an outer 
loop over dimensions and an inner loop which processes one dimension at a time. 
The ND layers should be used from 3D to 6D. 

This report does not analyze any networks with a dimension higher than two, but 
it is possible to replace ConvolutionSK layers by ConvolutionND layers in the SK- 
Net to get an arbitrary dimensional network. It may be required to reduce the 
network output size and feature map count in order to meet memory constraints. 

N-dimensional kernels are only implemented in CUDA and OpenCL and do not 
run as native CPU code. 


5.3 New Layers 


5.3.1 Merge Crop 


The MergeCrop layer is required in U- and USK-Net architectures (see Sections 3.4 
and 3.51. The layer accepts two input blobs: 

• Blob A of size 


• Blob B of size 

The layer outputs a blob containing all feature maps of A and B, which can have a 
different amount of feature maps. Input B has to be cropped to the size of A. The 
output feature maps are of size 


During backpropagation, the error maps are propagated through by copying them 
in the inverse direction. For U- and USK-Net, backwarding is only enabled for the 
input A, as input B gets the differential data on a different path (from the down 
sampling pooling layer) in the neural network. Copying back B would overwrite 
the gradients and interfere with the intended training. 


MergeCrop is only implemented in CUDA and OpenCL and does not run as native 
CPU code. 


5.3.2 Mails Loss 


MALIS stands for maximum affinity learning of image segmentation. The imple¬ 
mentation in Caffe Q that I provide is based on existing code to compute the 
Malis criterion for Matlab 1171 and Torch by Srinivas Turaga et al. [j^. 


Additional layer forward and backward functions (for interaction with Caffe), 
memory management and the two additional layers ConnectedComponent and Af¬ 
finity are new contributions. The layers are only implemented as CPU code and 
do not run on OpenCL or CUDA. 


Figure [5T] describes how the Malis criterion loss is used together with the network 
models presented in this report (see Chapter]^. In place of the Softmax activation 
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5.3. New Layers 


for two labels, a rectified linear unit or other activation could also be used. The 
split layer is required to feed the ground truth label blob into the two following 
components and affinity layers. 

The layer structure is separable so that the Mails loss can be used with feeding in 
external connected component and label affinity maps instead of computing them 
indirectly. Likewise, the neural network can directly learn affinity graphs instead 
of pixelwise labels. 



Figure 5.1: Mails loss setup DAG. In place of the single ip convolution layer, there 
would be a whole neural network (see Appendix]^. 


As the Mails criterion is calculating minimum spanning trees between pixel pairs 
on affinity graphs and is only selecting the minimum edge of the tree as error 
edge, if the position is indeed an error in the prediction, the actual Mails function 
is called twice on two affinity maps A+ and . 
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5.3. New Layers 


Having the predicted affinity graph Ap^ed {=proh Jiffinity) and the ground truth to 
it (Ajabei {=lahelJiffinity)), two new affinity maps are generated: 

A = rnin(Ap].gd, Ajaijgj) (5.1) 

A = iTiax(Api.gd, Ajabei) (5-2) 


Then, A+ will not contain any errors on the background prediction while A^ 
has no errors in the foreground. The result is that the Malis criterion is able to 
isolate errors in the background prediction (membranes with gaps) on A^ and find 
errors on the cell interior with A+. The resulting error map for backpropagation 
is AA = AA^ + AA+. 


In Equations |5.1| and 5.2 the affinity maps A denote the combination of vertical 
(A^y)) and horizontal (Al^^) affinity maps. The affinity graphs contain one value 
per edge between two neighboring pixels. A value close to 1 stands for high 
affinity while 0 means unconnected. 


Details on how the Malis criterion works internally can be found in the original 
paper on Malis by Srinivas Turaga et al. |j^. The implementation is quite efficient 
and does not contribute massively to the training time. 



(a) Background prediction map {1 ). (b) Foreground prediction map (1^). 

Figure 5.2: Network predictions after a few training steps with Malis loss 
(USK-Net, Section [33] |. The network can not separate membranes and 
mitochondria yet. The images correspond to the data (forward) component of the 

memory blob proh in Figure |5.1[ 


The error map Af+, which is derived from AA using the affinity layer during 
backpropagation, is depicted in Figure |5.3[ On a first look, it seems like only cell 
interior errors, the light gray areas, exist in the picture. 

When enhancing the contrast and marking errors in the membrane with red spots, 
it becomes more clear what happens: The Malis criterion calculates minimum 
spanning trees between pixels of the cell interior of two merged cells that should 
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5.3. New Layers 


be separated. The edge that will be corrected is almost always the same between 
two cells for all pixel pairs within those cells, due to shared edges in the spanning 
tree. This edge will therefore accumulate a high error value in AA“, but there are 
only a few such spots in every training patch. 

On the cell interior errors (AA+), the edges to be corrected occur on the border 
of cell interior areas that are mislabeled as cell membrane. This leads to a denser 
distribution of the error, but less error intensity per edge. 



(a) Original error map (AI+). Membrane (b) Contrast enhanced error map. Red 
errors are sparse and barely visible. marks denote membrane errors. 


Figure 5.3: The error map generated by the Malis loss. The image corresponds to 
the diff (backward) component of the memory blob prob in Figure |5.1[ The gray 
background color is zero (no) error, while black spots are negative errors from 
AA^ and light gray areas are positive errors from AA+. 


Malis loss and the U/USK-Net models match perfectly: As the networks can out¬ 
put very large patches up to 512 by 512 pixels without reaching memory limits of 
current GPUs (see Section |6.4| , cormected components and the Malis loss have a 
very large context to work on. 

As soon as the patch size is so small that it barely covers a cell, Malis becomes 
useless. With sliding window networks for example, it would be necessary to 
compute many forwarding iterations first before being able to run Malis and gen¬ 
erate an error map to backpropagate. It would also be necessary to retain all blobs 
of all forwarding iterations. The whole process would not be efficient in terms of 
memory consumption and training times. 

5.3.3 Affinity 

The affinity layer is an additional layer that has to be used in conjunction with 
Malis loss (see Figure [5T] |. 

The idea is that not only affinity graphs but also pixelwise classifications can be 
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5.3. New Layers 


learned with the Malis loss. During forwarding, the affinity layer has to look 
at neighboring pixels in the horizontal and vertical direction of the input and 
compute their connectivity (affinity). 


(A _ : r+ 

“ zi?,x+i ^ 

(5.3) 

S = min 1+ 

z=y,y+l 

(5.4) 

£y = arg min 4+ 

z=x,x+l 

(5.5) 

-S = arg min J+^ 

(5.6) 


z=M+l 


In Equations |5.3| to [ 5 !^ is the horizontal oriented affinity map, the vertical 

one. The affinity maps have the same size as the input image J. Additionally, the 
minimum index maps and M^y'> have to be stored so that the loss can be 
distribute accordingly during backpropagation. 


The Softmax layer (blob proh) in Figure [5d] and the ground truth labels (blob label Ji) 
actually store both the foreground and background (see Figure 5.21. For the image 
I used to produce the affinity map, only the foreground prediction map (J+) is 
considered because it stores 1 for foreground (connected) and 0 for background 
(disconnected). The resulting affinity graph will correctly have higher values for 
connected pixels than discormected ones. 


The affinity graph is computed twice: Once for the ground truth (blob label Jijfinity 
in Figure [5d] | and once for the current network prediction (blob probjiffinity). 

During backpropagation, the affinity loss has to be attributed to single pixels again, 
as the Malis criterion will attribute the error map to an affinity graph. Now, both 
the foregound (Af+) and background (Af^) have to get an error map to balance 
out the Softmax function. The loss is attributed symmetrically, AJ+ = —AI^. 


Initialization: 


= Air , 

= 0 

(5.7) 



(5.8) 


+ = A/lg 

(5.9) 


- = 

(5.10) 


+ = aa“ 

(5.11) 


- = AAg 

(5.12) 
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5.4. OpenCL Backend 


Equations |5.8| to |5.12 describe how to attribute the affinity loss back to pixel loss, 
given the minimum index maps and computed in the forward pro¬ 
cessing step. 


There are also other ways to compute an estimation to an affinity graph, such 
as averaging neighboring pixels. The choice for the minimum selection worked 
particularly well because there is no loss of resolution or aliasing when computing 
it this way. The Malis criterion selects the minimum edge of the affinity graph for 
creating the loss maps and AA^y\ Thus using the minimum valued pixel 

through and for attributing the pixel loss makes sense. Both objectives 
minimize the same error by either increasing or decreasing the affinity of the 
neighboring pixels. 


5.3.4 Connected Components 

The connected components layer is a small layer based on the OpenCV flood¬ 
filling algorithm and outputs separated connected components from a foreground- 
background labeled ground truth (Figure [5^ . Based on this map, the Malis loss 
knows which areas have to be separated and which are cormected. Here, the cell 
membrane, which is considered background, is not assigned to any component. 



Figure 5.4: Cormected components belonging to Figure The feature map 
corresponds to the memory blob component in Figure |5.1[ 


5.4 OpenCL Backend 


5.4.1 Implementation 

In my version of the Caffe library, an additional versatile backend for various com¬ 


pute devices, based on OpenCF and ViermaCF 1191, is available. The backend is 
called Greentea and is part of the Project Greentea consisting of frontend, models 
and modified Caffe library (see Figure [T3] |. In this section, an overview of interest¬ 
ing aspects how the backend works and how the Caffe library had to be changed 
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5.4. OpenCL Backend 


is given. Further details and a full documentation is available within the source 
code, which is available for download [|^. 

A key feature is that the OpenCL backend is feature equivalent to the CUDA 
backend. All GPU layers can be used on both backends. The OpenCL backend 
is also unit test verified and passes all test cases of the original Caffe library. The 
tests can be invoked by executing "make runtest" on the source code folder. 

It remains possible to compile the library with support for all backends at once. 
The compute kernel and BLAS calls can be dispatched dynamically at runtime, 
depending on what kind of device is selected. Every device available is registered 
in a new DeviceContext object that stores the device and backend type. 

The following aspects of the library had to be changed: 

• The Caffe library enumerates all devices on all enabled backends, starting 
with CUDA devices. The selected GPU number determines which Device- 
Context is set as the default. 


• The SyncedMem class that is used to manage the device memory can now 
store either a CUDA GPU pointer or an OpenCL cLmem memory object, 
depending on which device and backend the memory object belongs to. 

• The Forward^gpu and Backward_gpu functions now contain both OpenCL and 
CUDA code to call compute kernels and BLAS functions. 

• Network layers, SyncedMem and blob objects carry a pointer to a DeviceCon¬ 
text, which allows to express neural network object to device relationships. 
This is a feature for allowing future multi-device networks, where keeping 
track of which memory blob is on which device is essential. 

• All CUDA compute kernels are translated to OpenCL code. This is trivial 
for the most part, as the syntax is very similar. 

• The OpenCL backend can dispatch BLAS calls to ViennaCL-BLAS or clBLAS 
on GPUs. ViennaCL-BLAS is header-only and therefore easier to use, while 
clBLAS is optimized for certain AMD GPUs but has to be compiled separ¬ 
ately. 


On CPU devices, the Greentea backend mixes native CPU code with OpenCL 
code to achieve an optimal performance (see Section 5.4.2|. 


5.4.2 OpenCL Hybrid 

The OpenCL hybrid implementation describes how the OpenCL backend is used 
when selecting a CPU device instead of a GPU device. The two fundamental 
differences are memory allocation and BLAS library calls. 

When a SyncedMem object is instantiated, the memory is allocated as host memory 
rather than device memory. Differently than with the CPU backend, the memory 
is allocated through OpenCL. This allows to access the underlying memory pointer 
of OpenCL memory objects while also being able to use the memory in OpenCL 
kernels. 
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For BLAS calls, the following steps are executed: 

1. For all involved cljnem memory objects, the underlying host pointer is re¬ 
covered and mapped to a new CPU pointer. At this point, the OpenCL 
backend ensures all compute kernels accessing the memory concurrently are 
done executing so that it is safe to access the memory over CPU pointers. 

2. The BLAS call is dispatched to a cBLAS library (Intel MKL, ATLAS or Open- 
BLAS). Here, the most optimized BLAS for the CPU device can be selected. 
Optimally, it should be a BLAS that is fully parallelized and uses all CPU 
cores. NUMA issues (see Section \6.7) might occur. 

3. As soon as the BLAS call returns, all CPU pointers are unmapped. This 
signals to the OpenCL backend that it is safe again to start OpenCL kernels 
on the cljnem objects involved. 

When mapping cljnem objects, it can be specified that the access is read-only. In 
this case, OpenCL kernels that also only use the object in read-only mode can 
continue to run during the BLAS call. 


Using the OpenCL backend on CPUs is a design decision. An alternative would 
be to parallelize the existing CPU backend with OpenMP pragmas. However, as 
most of the computational complexity resides with the BLAS calls (see Section 


5.51 and the OpenCL kernels are not using local memory extensively, they run 


very well also on CPUs. Only the BLAS, which is very device specific, and needs 
to be optimized for the memory architecture (see Section [63] l, needs to be different 
from the GPU version of the OpenCL backend. 


5.5 Convolution Methods 


Convolutions are usually computed using three different methods: 

1. GEMM (matrix multiplication) convolutions, requiring a reshape of the input 
to fit the BLAS SGEMM scheme. 

2. Direct convolutions, shifting the convolution kernel directly over the input. 

3. EFT domain convolutions, requiring to compute at least two Fourier trans¬ 
forms and one inverse Fourier. 


Gaffe implements GEMM convolutions. The advantage is that highly efficient 
BLAS libraries are available specifically for various devices, such as clBLAS, cuBLAS 
and OpenBLAS. It is very hard to implement convolutions more efficiently using 
direct convolution. The performance of such implementations is not portable for 
different kernel sizes and hardware types. In my own preliminary experiments, 
not even 10 % efficiency could be reached in the zpl layer of SK-Net, compared to 
up to 90 % using GEMM convolution, including the time for input reshaping (see 
Section [6.6.1 Table 6.6 and Figure [G^ . 


It is particularly difficult to get good local and global memory access patterns 
when programming kernels for direct convolution. BLAS libraries have already 
been optimized to use GPU local memory and GPU caches optimally. 
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GEMM convolutions also simplify the implementation of higher dimension and 
strided kernel convolutions, as only the code for reshaping the input has to be 
adapted. In Caffe, these functions are called imZcol and collim. 


A huge disadvantage with GEMM convolution is the memory requirement for the 
convolution buffer, discussed in Section 6.4 Assuming square sized kernels and 
output images. Equation 5.13 gives the buffer size in float elements. The kernel 
and output dimension is denoted by x. The network architectures in this report 
all use X = 2 (2D). 

Mbuffer =fin-k^-W^ = K-N (5.13) 


A matrix multiplication consists of three matrices, A G B G and 

C G R^^^. In Caffe, A is the weight matrix, B the column data after imicol and 
C = A ■ B the layer output. 

The dimensions are M = /out/ N = zv^ and K = The resulting computa¬ 

tional complexity is 0{M ■ N ■ {2- K — 1)) = 0{fout -zv^ ■ {2- /in ■ k^ — 1)). 


K — k^ • fin Ai = 2X7^ iV = 
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Eigure 5.5: Matrix-matrix multiplication in Caffe. 


In the BEAS libraries, row-major NN-SGEMM has to be used. Row-major means 
the leading dimension in memory is the matrix row. NN means that the matrices 
A and B are both not transposed. SGEMM stands for single precision general 
matrix matrix multiplication. 


In the scope of this project, EET convolutions have not been considered. There 
were no suitable EET libraries available for all devices that needed to be supported. 
Normal EET convolution also uses a lot of additional memory, as all kernels are 
stretched to the size of the input in Eourier space. The device memory is typically 
insufficient (see Section 6.21 to store all EET kernels for reuse, and recomputing 
every kernel is too time intensive. 


The cuDNN library also implements a modified form of GEMM convolutions, and 
they also evaluated EET and direct convolution as options p0| . A recent paper 
about using fast EET convolutions exists (Vasilache et al. pT|), but managing the 
device memory remains difficult. 
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Chapter 6 


Benchmarks 


6.1 Introduction 

This chapter assesses the performance (or efficiency) of different models across a 
variety of hardware devices. This is essential for speeding up neural networks by 
finding and eliminating computation and memory bottlenecks. 

All theoretical computations in this chapter assume two dimensional square sized 
compute kernels and feature maps. This is in accordance to how the network 
architectures are set up (see Chapter]^. 

6.2 Hardware 


Vendor 

AMD 

nVidia 

Intel 

Intel 

Device 

W9100 

GTX 980 

i7-4790K 

2x E5-2697v3 

Compute Units 
(Shaders) / (Threads) 

44 (2816) 

16 (2048) 

4(8) 

28 (56) 

Memory (GiB) 

16 

4 

16 

>16 

Clock Frequency (MHz) 

930 

1126 

4000 

2600 

Performance (GFLOP/s) 

5240.0 

4612.0 

512.0 

2329.6 

Memory Speed (GiB/s) 

320 

224 

25.6 

2 x 68 


Table 6.1: Hardware used for benchmarking. 


The device specifications in Table |6.1| are taken from the official white papers 
(AMD [ [2^ , nVidia Intel ARK ]2^). The FLOP/s performances indicated 
assume the fused-multiply-add (FMA) operation. 

The two AMD W9100 graphics cards have kindly been sponsored by AMD p5| , 
as noted in the acknowledgments. The card has special features such as high 
64 bit precision performance and error correcting memory. Those were not used 
for Caffe, only the OpenCL 2.0 driver and the large amount of memory was of 
importance. The W9100 is a workstation card, but consumer cards from both 
nVidia and AMD can also be used without restrictions as long as there is enough 
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6.3. Software 


device memory (R9 290X, R9 390X, Titan, Titan X). This is to be considered when 
building low-cost, high-throughput systems for neural networks. 

The i7-4790K CPU and GTX 980 GPU are devices of my personal workstation, 
specially acquired to test with up to date hardware (as of 2015). 

Extended testing with the E5-2697v3 processors and its issues (see Section |6.7| 
was not possible as this device was a workstation that I could only access briefly 
during my time at HHMI Janelia. 

Eor all benchmarks in this chapter, the W9100 GPUs have been used, because 
they were the only available GPUs capable of running all models in forward- and 
backward-mode on a wide range of output sizes, due to memory requirements. 
The only exceptions to this are the hardware comparison benchmarks and where 
indicated explicitly. 

Unless otherwise noted, Intel is used as an alias for the i7-4790K processor, nVidia 
for the GTX 980 GPU and AMD for the W9100 GPU in this chapter. 

6.3 Software 

The modified version of Gaffe |j^ in Project Greentea supports a variety of con¬ 
figurations that perform differently depending on the compute device. Table |6.2| 
represents the setup used for benchmarking. It is the best performing setup pos¬ 
sible for each combination of backend and device for the models in Ghapter 


Device 

Backend 

Memory Allocation 

BLAS 

Compute Kernels 

Intel 

GPU 

Host (native) 

OpenBLAS 

Gaffe (GPU native) 

Intel 

OpenGL 

Host (OpenGL) 

OpenBLAS 

Greentea (OpenGL) 

AMD 

OpenGL 

Device (OpenGL) 

clBLAS 

Greentea (OpenGL) 

nVidia 

OpenGL 

Device (OpenGL) 

clBLAS 

Greentea (OpenGL) 

nVidia 

GUDA 

Device (GUDA) 

cuBLAS 

Gaffe (GUDA) 


Table 6.2: Software configuration used for benchmarking. 


OpenBLAS is compiled to use all GPU cores through OpenMP and supports all 
vector extensions available on the GPUs used. Alternatively, the cBLAS header 
interface also supports Intel MKL and ATLAS as replacements for OpenBLAS. 

The GPU could also be used with clBLAS. This is not advisable, as clBLAS is op¬ 
timized for GPUs, which have a different memory architecture than GPUs. While 
there is a cache hierarchy on the GPU, the GPU needs to use fast local memory 
to buffer blocks of the matrix (from global device memory) temporarily. Local 
memory does not exist on the GPU, therefore it would result in copying data need¬ 
lessly on the host memory. This results in low efficiency, because GPUs already 
have a slow memory interface compared to GPUs (see Table [6^. 


As ViermaGL |19| was used as a part of the OpenGL backend, ViennaGL-BLAS 
is also available as an alternative to clBLAS. It is slower than clBLAS, but more 
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6.4. Device Memory 


convenient to use as it does not need to be compiled separately and only consists 
of C++ header files. 


6.4 Device Memory 


As all networks presented can be run with almost any size of output (up to re¬ 
strictions given by layers such as even divisibility of the input feature maps, see 
Chapter |^, the networks can be set up so that they fit the memory and computa¬ 
tional restrictions given by each device. The networks do not have to be re-trarned 
in this case and results are numerically identical. 

A second objective may be to use processing output sizes matching the dataset: 
Non-square outputs such as 256 by 32 pixels are also possible. Then, the sizes 
can be set so that the image sizes to process are divisible by the output size of the 
network. Like that, no computations are wasted. 

It is also important to note that there is no easy scaling rule as to how the memory 
requirements will change with different output sizes as this depends on the num¬ 
ber of feature maps and their sizes on all layers as well as the maximum convolu¬ 
tion buffer size (Mbuffer)- 


An upper bound estimation for re » r; is that the memory usage increases propor¬ 
tional to w^, where w is the output size and v the total padding size as in Table 

1631 

It follows that the network gets more efficient with a bigger ratio removing 
more overlapping computations. Using re = 1 on the SK network for example 
results in having the same efficiency as a minibatch sliding window (SW) network, 
which is about 50 times slower (see Table |6.5| than the corresponding SK network 
with 128 by 128 output. 

On 3D networks, the memory allocation would scale in the order of re^, limiting 
the output patch-cube size very quickly. 


The smallest memory available in the test hardware was 4GiB (see Table 6.11, 
thus the networks have been set up as in Table 6.3 for forward processing. The 
configurations are the same as for training and as described in Chapter except 
for the USK net, where half the size was used (256 instead of 512). The resulting 
memory requirements are visible in Figure |6.1| and Table 6.4 


Training the USK model with 512 x 512 pixels, which improves training with Malis 
loss, requires up to 8 GiB of device memory, which is close to what scaling propor¬ 
tional to iv^ predicts. 


Network 

SK 

USK 

U 

Processing output size iw x w) 

128 X 128 

256 X 256 

388 X 388 

Training output size iw x w) 

128 X 128 

512 X 512 

388 X 388 

Total padding size (z?) 

102 

180 

184 


Table 6.3: Output and padding configurations in pixels. Output sizes are mostly 
flexible, but the padding size is a fixed network characteristic. 
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6.4. Device Memory 


Network 

SK 

USK 

U 

Data Blobs (MiB) 

219 

560 

1134 

Processing (MiB) 

2759 

1621 

1710 

Training (MiB) 

3056 

2204 

2955 


Table 6.4: Peak device memory usage when using the processing output sizes 
denoted in Table 6.3 for both processing and training. 



Figure 6.1: Peak device memory usage when using the processing output sizes 
denoted in Table 6.3 for both processing and training. The comparison reveals 
that SK networks would use a lot more memory than USK and U with the same 
output size, while USK and U are roughly comparable. 


The memory usage during training is always higher than the processing require¬ 
ments. This is because the error / difference is stored during back propagation. 
The data blobs therefore have to be stored twice (data and difference) for all layers 
that are back propagated - making the difference between training and processing 
roughly the size of the data blobs, as can be seen in Figure |6.1[ 

With SK and USK models, data, difference and weights are the smallest contribut¬ 
ors to memory usage. The biggest consumption comes from the temporary buffer 
needed to compute convolutions with matrix multiplications. The effect is much 
smaller with U-Net architectures. 

To assess this, the size of the buffer can be estimated: 

Mbuffer = max (6.1) 

;(')eUonv. 

The size of Mbuffer is in floating point elements. To get an estimate in bytes, it has 
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6.4. Device Memory 


to be multiplied by the GPUs floating point precision size, which is typically 4 
bytes. 


The maximum buffer allocation that worked with OpenCL and CUDA was 4 GiB 
for a single memory block. This is the upper limit for Mbuffer- With the most 
expensive convolution in the layer of SK-Net (see Section 6 .6. 1[ ), it is possible 
to compute the limits for the network output size w, according to Equation |6.1| 

/in = 192, /out = 1024 and k = 10 are fixed network parameters. Thus, the 

236 pixels. 


maximum is re = 


For U-Net, this buffer is not the dominant factor that limits the network. In the 
USK-Net design, the ipl layer is of less importance and has less input feature maps 
and a smaller kernel size. This reduces the Mbuffer limitation and allows bigger 
output patches, making the network more efficient. 

With the OpenGL backend (Section [SAj l, the memory overhead is up to min(n, q) ■ 
Mbuffer/ where q is the number of parallel work queues and n the minibatch size. 
This helps to speed up the many small convolutions that occur in SW networks 
by starting up to q convolutions in parallel. This feature can not be used with 
the GUDA and GPU backend. For GUDA, the solution in this case is to use 
cuDNN, which streams the convolutions in batches to be more efficient 1^ . Both 
parallel queues and cuDNN are not of importance here, because the SK, U and 
USK architectures all work efficiently with having n = 1. 


Reusing the convolution buffer is also a new feature in the improved Gaffe library 
1^. With the original Gaffe library ||^, the memory overhead would have been 
much higher, allocating one buffer per convolution layer: 


Mbuffer = E /in (6-2) 

6')6Lconv. 


Those buffers are in both cases persistent, because freeing them and re-allocating 
would cause too much run time overhead. With re-using the buffer, it would 
also not decrease the peak allocation any further. Alternative ways to implement 
convolutions are discussed in Section 15.51 

The memory consumption can not be decreased further during training, because 
the data blobs have to persist during forward- and backward-computation to calcu¬ 
late the difference and update the weights in training. However, during processing, 
a lower bound estimate is given by: 


Mtotai = min(n, q) ■ Mbuffer + n ■ max (re^' + /iut(«^^'^)^] (6-3) 

This assumes that the blobs B of a network are not persistent and at most the input 
and output of the most memory consuming layer has to be stored. A network with 
its blobs can be seen as a directed acyclic graph (DAG). Therefore, the estimate is 
higher if there are layers / blobs that exist in parallel with each other. 

Gurrently (as of August 2015) there is no Gaffe implementation that re-uses the 
blobs in this way. It would be necessary to analyze the DAG when instantiating 
the network first. Then, blobs would have to be allocated and assigned so that no 
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6.5. Labeling Throughput 


conflicts exist. The DAG would need to be split up for analysis in the case that 
there are multiple devices with independent memory working on different parts 
of the network. 

The implication is that lower end devices such as GPUs with less memory and 
even mobile devices would be capable of classifying with larger networks such as 
those presented in this report. 

I chose to not implement this because there was enough GPU memory available 
and Figure |6.1| indicates the reduction would not be very large, as most memory 
is consumed by matrix-matrix multiplication buffer (Mbuffer)- 

6.5 Labeling Throughput 


Device 

Backend 

SW 

SK 

USK 

U 

Intel 

GPU 

73 

(P” 


(jlST 

Intel 

OpenGL 

98 

3 504 

19414 

32 692 

nVidia 

OpenGL 

850 

28461 

267478 

390156 

AMD 

OpenGL 

1108 

59 513 

354865 

483 766 

nVidia 

GUDA 

1488 

85460 

658018 

1058125 


Table 6.5: Labeling throughput 
not implemented 


The labeling throughput (Table |6.5| is an overall performance measure for neural 
networks. It also shows how different devices and backends perform. Even on the 
GPU, using the fastest network (U) gives a speedup of 447 x compared to what 
was achievable with Gaffe |j^ prior to Project Greentea |j^. When using a network 
(SK) that gives identical results as the original SW network, the GPU speedup is 
still a factor of 48 x. 


On AMD GPUs, speedups of 54 x (SK-Net), 437 x (U-Net) and 320 x (USK-Net) 
compared to SW-Net are possible. The nVidia GPU scales similarly on both the 
OpenGL (SK: 33x, U: 459x, USK: 315x) and GUDA (SK: 57x, U: 711x, USK: 
442 x) backend. 


SK, USK and U networks can not be executed directly on the legacy GPU backend, 
as layers such as strided kernel layers and merge crop have no native GPU kernels 
implemented. They are only available on GUDA and OpenGL. The fact that the 
OpenGL backend on GPUs is better parallelized (see Section 5.4.2| than native 
GPU execution in Gaffe and that the speedups between networks are similar on 
GPUs and GPUs justifies not implementing the GPU kernels. 

Figures [6^ and [63] represent the values of Table 6.5 in both logarithmic and linear 
scale. 


While the GUDA backend is usually the fastest, the nVidia GPU performs slower, 
as expected by the FLOP values in Table 6.1 relative to the AMD GPU when 
using OpenGL on both of them. Explaining this behavior needs insight into the 


45 



















6.5. Labeling Throughput 


individual network layer performance (Section |6.6| and the convolution operations 
(Section |5.5| |. 


(V 

03 


q; 

X 

'q. 


ZJ 

Q. 

U) 

ZJ 

o 


q; 

03 



CNN Architecture 


Figure 6.2: Log scaled labeling throughput 



Figure 6.3: Linear scaled labeling throughput 


The network performance during training was not assessed in this report, as train¬ 
ing the models on the available data was not a limiting objective. Back propagation 
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6.6. Layer Performance Analysis 


is usually slower than forwarding, as the differential maps, gradients and weight 
updates have to be computed. 

6.6 Layer Performance Analysis 

This section takes apart the neural networks down to individual layers to assess 
why certain networks are faster than others and find potential to optimize models. 


IZZI 

Data 

IZZI 

Convolution 

IZZI 

Pooling 


Upconvolution 

IZZI 

Activation 

IZZI 

Other 


Figure 6.4: Layer-wise analysis legend. 


6.6.1 SK-Net 


16: prob 
0.15 ms, 0.06% 
15: ip3 
1.57 ms, 0.56% 
14: relu5 
0.56 ms, 0.20% 
13: ip2 
10.02 ms, 3.61% 
12: relu4 
0.97 ms, 0.35% 


11: ipl 
230.56 ms, 83.01% 



0: data 
0.27 ms, 0.10% 
1: silence 
0.02 ms, 0.01% 
2: convl 
3.49 ms, 1.25% 
3: relul 
0.34 ms, 0.12% 
4: pooll 
1.00 ms, 0.36% 
5: ccinv2 
14.64 ms, 5.27% 
6: relu2 
0.49 ms, 0.18% 
7: pool2 
1.56 ms, 0.56% 
8: ccinv3 
9.63 ms, 3.47% 
9: relu3 
0.53 ms, 0.19% 
10: pool! 

1.96 ms, 0.71% 


Figure 6.5: Layer-wise timings of SK-Net. 


Looking at Figure |6.5} the convolution layers and especially the ipl layer (83 %) are 
responsible for the overall performance. There are many input (192) and output 
(1024) feature maps (see Section [33| and ther efor e a high computational complex¬ 
ity (©(/out • • (2 • /in • - 1))) (see Sectionon the ipl layer. 

As the convolution layers only consist of a fast memory copy operation to ar¬ 
range the data, so that matrix-matrix multiplications through an optimized BLAS 
become possible, and the SGEMM (single precision general matrix-matrix multi- 
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6.6. Layer Performance Analysis 


plication) call itself, the efficiency values in Table 6.6 and Figure 6.6 are direct 
proxies for how efficient the BLAS works on the devices. 


Layers that operate with complexity 0(/in ■ which includes all other layers 
used in the SK, USK and U networks, only contribute a small fraction to the total 
forwarding time (less than 1 % per layer). 


Layer 

GFLOP 

AMD OCL 

nV OCL 

nV CUDA 

Intel OCL 

convl 

0.70 

3.83% 

8.79% 

13.84% 

6.16% 

conv2 

14.06 

18.33% 

27.13% 

56.45% 

12.89 % 

conv3 

18.40 

36.46% 

22.05 % 

21.78% 

18.14% 

ipl 

644.23 

53.32% 

27.15% 

90.50% 

34.83% 

ip2 

17.17 

32.72% 

23.83% 

62.15% 

22.35 % 

ip3 

0.03 

0.41 % 

0.75% 

0.75% 

0.12% 


Table 6.6: FLOP efficiency for the convolution layers in the SK-Net. The values 
are relative to Table 6.1 FLOP performances. 


-• GFLOP 


□ □ AMD OpenCL • nVidia CUDA 

A A nVidia OpenCL o o Intel OpenCL 



Layer 


Figure 6.6: FLOP efficiency of the convolution layers in the SK-Net. 


Even though ipl is the most expensive layer, it also is the most efficient (see Figure 


6.61 on all devices, when using the BLAS which is most optimized. This means 
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6.6. Layer Performance Analysis 


clBLAS for AMD, cuBLAS for nVidia and OpenBLAS for Intel. 

It is important to note that clBLAS is not yet as optimized as cuBLAS for this type 


of matrix-matrix multiplications (see Section 5.51. Both GPUs currently perform 
worse than expected with the OpenCL backend. As both clBLAS and the Project 
Greentea are still quite recent projects, the performance is expected to increase with 
further optimizations in the future. 

Other convolution layers are over 30 times less expensive and down to half as 
efficient. With small convolutions, the memory copy and kernel launch overhead 
lower the efficiency. Very small convolutions such as ip3 and convl may also not 
fully utilize the many threads available on GPUs (see Table 6.11. As the ineffi¬ 
cient layers have short computation times here, they are not very important in 
optimization. 


Only if the network mostly consists of such inefficient layers (which is not the 
case with SK-Net, but does apply to SW-Net), the number of feature maps and 
convolution sizes should be increased, if it improves the segmentation. Using 
minibatches (n > 1) can increase the GPU utilization on OpenGL when using 
multiple queues {q > 1) and the efficiency on GUDA when using cuDNN pO). 


6.6.2 U-Net 


56: conv22 
10.87 ms, 3.48% 

54: conv21 
20.75 ms, 6.64% 

53: mergecrop4 
4.07 ms, 1.30% 

52: conv20 
2.89 ms, 0.93% 

51: upconv4 
9.78 ms, 3.13% 

49: convl9 
9.22 ms, 2.95% 

47: convl8 
18.14 ms, 5.80% 

46: mergecrop3 
2.20 ms, 0.71% 

45: convl7 
2.73 ms, 0.87% 

44: upconv3^^ 
7.33 ms, 2.34% 

42: convl6 
8.39 ms, 2.68% 

40: convl5 
17.01 ms, 5.44% 

38: convl4 
3.08 ms, 0.98% 

37: upconv2 
9.45 ms, 3.02% 

35: convl3 
12.10 ms, 3.87% 

33: convl2 
25.84 ms, 8.27% 



2: convl 
7.25 ms, 2.32% 
4: conv2 
21.68 ms, 6.94% 
8: conv3 
10.00 ms, 3.20% 
10: conv4 

16.47 ms, 5.27% 
14: conv5 
7.57 ms, 2.42% 

16: conv6 
12.99 ms, 4.15% 

20: conv7 
8.06 ms, 2.58% 

22: conv8 
14.27 ms, 4.56% 

26: conv9 
6.91 ms, 2.21% 
28: convlO 

11.47 ms, 3.67% 
30: upconvl 
16.21 ms, 5.19% 
31: convll 
3.65 ms, 1.17% 


Figure 6.7: Layer-wise timings of U-Net, excluding layers that contribute less 
than 0.5 % of the total forwarding time. 
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6.6. Layer Performance Analysis 


Layer 

GFLOP 

AMD OCL 

nV OCL 

nV CUDA 

Intel OCL 

convl 

1.10 

2.90% 

6.58 % 

7.69 % 

5.50% 

conv2 

23.77 

20.92 % 

19.54% 

43.92 % 

7.60% 

conv3 

11.72 

22.35 % 

26.10 % 

53.21 % 

19.15% 

conv4 

23.11 

26.78 % 

22.90 % 

58.79 % 

13.35% 

conv5 

11.23 

28.31 % 

29.78 % 

63.92 % 

23.50% 

conv6 

21.81 

32.06% 

24.70 % 

73.71 % 

24.66 % 

conv7 

10.27 

24.32 % 

32.31 % 

77.68 % 

40.24% 

convS 

19.33 

25.85% 

24.91 % 

88.07% 

42.55% 

conv9 

8.49 

23.44% 

32.93 % 

76.82 % 

54.12% 

convlO 

14.80 

24.62 % 

25.05 % 

84.03 % 

54.30% 

convl1 

3.29 

17.17% 

26.57% 

78.12% 

38.53% 

convl2 

27.52 

20.33% 

32.39 % 

67.04 % 

40.50% 

convl3 

12.76 

20.12% 

23.86 % 

75.78 % 

43.25 % 

convl4 

2.83 

17.56% 

24.03 % 

71.13% 

28.54% 

convl5 

24.54 

27.54% 

29.25 % 

71.21 % 

22.74% 

convl6 

11.79 

26.83% 

24.10 % 

75.04 % 

25.90% 

convl7 

2.62 

18.29 % 

21.74% 

61.10% 

23.22 % 

convlS 

23.12 

24.32 % 

27.82 % 

62.06 % 

16.63% 

convl9 

11.32 

23.43 % 

21.69% 

59.62 % 

13.79 % 

conv20 

2.51 

16.54% 

19.09 % 

39.84% 

17.82 % 

conv21 

22.42 

20.62 % 

23.77% 

45.12% 

10.44% 

conv22 

11.09 

19.47% 

18.84 % 

43.50 % 

8.09% 

ipl 

0.04 

0.79% 

1.84% 

2.40 % 

0.35% 


Table 6.7: FLOP efficiency for the convolution layers in U-Net. The values are 
relative to Table 6.1 FLOP performances. 


In the U-Net, most convolutions have the same order of magnitude in complexity 
(see Table |6.7| . The result is a very balanced network in terms of forward tim¬ 
ings (see Figure [6y) |. The balancing comes from trading feature map size against 
feature map count towards the middle of the network. 

The upconvolution layers contribute each up to 5.2 % of the network forward¬ 
ing time. This is less efficient than the optimum as only 4-nearest neighbor in¬ 
terpolation with constant weights is computed. This would not be more effort 
than a memory copy operation during forwarding and accumulating of the four 
nearest neighbor values during backward computation. Currently, it is implemen¬ 
ted using a Caffe deconvolution, which is a reversed convolution layer. This layer 
already existed in Caffe, while no direct upsampling of feature maps is implemen¬ 
ted. The convolution kernels are grouped, meaning each upsampling kernel only 
considers a single feature map. One advantage is that the deconvolution layer also 
allows adaptive weights and other interpolations such as bilinear upsampling. 
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6.6. Layer Performance Analysis 


-• GFLOP 


□ □ AMD OpenCL • nVidia CUDA 

A A nVidia OpenCL o o Intel OpenCL 



Figure 6.8: FLOP efficiency of the convolution layers in U-Net. 


6.6.3 USK-Net 


28: ip3 
0.60 ms, 0.32% 
27: relu9 
0.23 ms, 0.12% 
26: convS 
10.00 ms, 5.39% 
25: reluS 
0.42 ms, 0.23% 
24: conv7 
20.57 ms, 11.10% 
23: mergecropl 
3.26 ms, 1.76% 
22: conv6 

3.58 ms, 1.93%' 
21: upconvl 

11.06 ms, 5.96% 
20: relu7 
0.26 ms, 0.14% 
19: ip2, 

3.59 ms, 1.94% 
18: relu6 

0.54 ms, 0.29% 


17: ipl 
58.90 ms, 31.77% 



2: convl 
4.87 ms, 2.62% 
3: relul 
0.63 ms, 0.34% 
4: conv2 
12.18 ms, 6.57% 
5: relu2 
0.59 ms, 0.32% 
7: pooll 
0.75 ms, 0.40% 
8: conv3 
18.51 ms, 9.99% 
9: relu3 
0.38 ms, 0.21% 
10: pool2 
1.32 ms, 0.71% 
11: conv4 
15.42 ms, 8.32% 
12: relu4 
0.39 ms, 0.21% 
13: pool3 
1.21 ms, 0.65% 
14: conv5 
14.49 ms, 7.81% 
15: relu5 
0.39 ms, 0.21% 
16: pool4 
1.08 ms, 0.59% 


Figure 6.9: Layer-wise timings of USK-Net, excluding layers that contribute less 
than 0.1 % of the total forwarding time. 
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6.6. Layer Performance Analysis 


Layer 

GFLOP 

AMD OCL 

nV OCL 

nV CUDA 

Intel OCL 

convl 

0.64 

2.51 % 

4.97% 

7.08% 

5.43% 

conv2 

13.75 

21.55% 

19.25 % 

42.11% 

8.04% 

conv3 

26.25 

27.06 % 

20.14% 

62.58% 

11.73% 

conv4 

21.81 

26.99% 

21.78% 

62.61 % 

12.57% 

conv5 

18.92 

24.93 % 

27.45 % 

63.93% 

13.06% 

ipl 

141.76 

45.93% 

31.25% 

86.64% 

27.82 % 

ip2 

4.43 

23.53% 

28.14% 

72.69 % 

42.59 % 

conv6 

4.42 

23.56% 

21.29% 

63.39% 

31.44% 

conv7 

29.44 

27.31 % 

27.88 % 

61.29% 

17.26% 

conv8 

9.66 

18.44% 

20.45 % 

46.17% 

7.34% 

ip3 

0.02 

0.53% 

1.73% 

1.73% 

0.36% 


Table 6.8: FLOP efficiency for the convolution layers in USK-Net. The values are 
relative to Table 


6.1 


FLOP performances. 


-• GFLOP 


□ □ AMD OpenCL •—* nVidia CUDA 
A A nVidia OpenCL o o Intel OpenCL 



Figure 6.10: FLOP efficiency of the convolution layers in USK-Net. 


In the USK network, the ipl layer is again the most expensive and efficient one, 
but the network is more balanced as ipl only contributes 32 % of the forwarding 
time, compared to 83 % in the original SK-Net. Efficiency and forwarding times 
are, as expected, a mixture of the SK- and U-Net. 
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6.7. NUMA Issues 


Except for ip3 and convl, all layers have a relatively high efficiency. The inefficient 
layers go from /in = 64 to /out = 2 and /in = 3 to /out = 64 respectively. This 
will directly result in matrix multiplications with matrices that have strongly non¬ 
square shapes and are therefore less efficient (see Section [53|. 


6.7 NUMA Issues 


An issue that came up testing the OpenCL hybrid backend (see Section 5.4.2| was 
that the performance did not scale as expected with systems that have more than 
one CPU. Such systems have non-unified memory access (NUMA) because the 
CPUs share one address space for memory, but every processor has its own cache 
and memory interface. Accessing data across the other CPU comes with a large 
performance penalty. Compute kernels, such as the matrix-matrix multiplication 
in the BLAS library or the custom OpenCL kernels, cause the threads to work on 
adjacent data. This means a write operation of one CPU is likely to invalidate 
cache lines across both CPUs. At this point, the s}mchronization overhead seems 
to become larger than any speedup of having additional cores working on the 
algorithms. 


Layer 

i7-4790K 

2x E5-2697v3 

Speedup 

convl 

22.19 ms 

47.02 ms 

0.47 

relul 

2.68 ms 

5.08 ms 

0.53 

pooll 

14.65 ms 

17.17ms 

0.85 

conv2 

213.16 ms 

176.96 ms 

1.20 

relu2 

4.94 ms 

14.04 ms 

0.35 

pool2 

40.06 ms 

44.40 ms 

0.90 

conv3 

198.15 ms 

238.91 ms 

0.83 

relu3 

5.85 ms 

17.45 ms 

0.34 

pool3 

57.08 ms 

59.05 ms 

0.97 

ipl 

3612.29 ms 

4168.21 ms 

0.87 

relu4 

13.59 ms 

47.62 ms 

0.29 

ip2 

150.03 ms 

187.05 ms 

0.80 

relu5 

5.90 ms 

24.10 ms 

0.24 

ip3 

56.25 ms 

77.51 ms 

0.73 

prob 

0.48 ms 

4.88 ms 

0.10 


Table 6.9: 4 core i7-4790K versus 2 CPU 28 core (NUMA) E5-2697v3, comparing 
SK network layer wise forward timings. 


Each CPU has 14 cores, which gives 28 cores for the whole system. The number of 
threads for the frontend was set to 14, which gave the best performance by keep¬ 
ing at least the BLAS library temporarily tied to one processor by the operating 
system's scheduler. The OpenCL backend, which also allocates the memory, used 
56 threads and allocated memory on both interfaces. 

Table 6.9 shows the impact of having NUMA issues, taking the SK network as 
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6.7. NUMA Issues 


an example. The layers are all sufficiently parallelized, which is evident when 
looking at Figure |6.6[ Differences of the CPU architectures are also not a possible 
explanation as both processors are based on the same instruction sets and gen¬ 
eration. The memory interface should also be fast enough to keep up with the 
computations (see Table 6.11. 


Correcting for processor frequency, the speedup should be up to a factor 4.55 x 
using all 28 cores (ReLU and pooling layers) and up to 2.275 x using only one pro¬ 
cessor (inner product and convolution layers). The effective speedups measured 
are much slower: 


• Speedups between 0.1 x and 0.97x on element-wise layers with complex¬ 
ity up to 0{fi-n ■ zo^). Cache invalidation between the two processors (28 
cores) seems to be dominant. The element-wise kernels run on the OpenCL 
backend. 

• Speedups between 0.47x and 1.20x on convolution layers with complexity 

up to 0{fout • ■ (2 • /in ■ — 1)) (matrix-matrix multiplication). The effects 

of OpenBLAS running on 14 cores and sub optimal memory allocations are 
dominant. It is a proxy for how the matrix-matrix multiplications used in all 
convolutions perform. 

To get the expected speedup, the two processors need to be presented to the Caffe 
library as two separate devices. Then the library can be used in two individual 
instances. As the OpenCL hybrid backend uses two separate parallelization mech¬ 
anisms (OpenCL kernels and a parallelized BLAS), two solutions would need to 
be applied: 

• The Caffe frontend needs to be tied to the cores of one CPU, so that the BLAS 
library does not show NUMA issues. 

• The OpenCL backend needs to split up the processor setup into sub-devices 
using device fission. The splitting rule needs to be that all cores belonging 
to one processor (tested by cache affinity) are tied to the same sub-device. 
Only one is then used per Caffe instance. Device fission is an extension to 
OpenCL that is already available {cl^xt^ssion p^). 

• The cores used in the frontend and selected sub-device need to be the same. 


Due to not having permanent access to a system with two processors and OpenCL 
installed, I did not have time to test out the solutions. Implementing the solutions 
remains as an open issue at the time of the project. 
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6.8. Alexnet 


6.8 Alexnet 

For comparison how the backends and devices perform on a widely used network 
for image classification that uses minibatches (with n = 10, w = 227, = 3) and 

multiple OpenCL queues (q = 8), the Alexnet Q included in the Caffe library was 
also evaluated. 


Device 

Backend 

Forward 

Backward 

Total 

Forward Speedup 

Intel 

GPU 

452.807 ms 

355.553 ms 

808.420 ms 

1.00 

Intel 

OpenGL 

238.188 ms 

152.811 ms 

391.160 ms 

1.90 

AMD 

OpenGL 

52.348 ms 

128.009 ms 

180.420 ms 

8.65 

nVidia 

OpenGL 

26.055 ms 

47.446 ms 

73.540 ms 

17.37 

nVidia 

GUDA 

20.899 ms 

16.730 ms 

37.718 ms 

21.66 


Table 6.10: Alexnet timings, average forward-backward pass over 50 iterations. 


The CUDA backend has an advantage over the OpenCL backend in terms of speed, 
but is less versatile. The AMD GPU seems to be less efficient with minibatches 
and smaller matrix-matrix multiplications than the nVidia GPU, which is why the 
AMD GPU performs worse than the nVidia GPU on the same backend. With the 
SK-, USK- and U-Net (Section [63] l, the AMD GPU performs better using the same 
(OpenGL) backend on both GPUs. 

Especially the backward computation is much slower (by a factor of three) using 
OpenGL instead of GUDA. The algorithms used are the same, therefore this differ¬ 
ence is difficult to explain. Possibly, different optimizations need to be applied in 
the backward step of convolution layers, which can have a sequential bottleneck 
by adding up the gradients over the minibatch. 

Using the OpenGL hybrid backend on the Intel GPU outperforms the (legacy) GPU 
backend by almost a factor of two. The speedup comes from parallelization of 
Greentea's OpenGL compute kernels, which are only single-threaded in the Gaffe 
GPU backend. The BLAS library used for convolutions is multithreaded in both 
cases. 
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Chapter 7 


Results 


7.1 Introduction 

In this chapter, the results of training the models from Chapter]^ are presented. 
The evaluation is based on: 

• Two data sets (DSl and DS2: see Chapter]^. 

• Three models (SK, U, USK: see Chapter]^. 

• Two training loss functions (Softmax and Malis). 

• One processing loss function (Softmax). 

• Three training configurations (Softmax, Malis and Softmax + Malis). 

• lO'OOO training iterations per configuration, respectively 20'000 for combined 
training. 

• Three error objectives (rand, warping and pixel error). 

The amount of training iterations was chosen so that training of all 18 combina¬ 
tions was feasible during one week on two AMD W9100 GPUs, providing a total 
of lOTFLOP/s 

The training data is not very large in both cases (see Chapter]^ and thus the loss al¬ 
ways converged in under lO'OOO iterations for each training method. It is possible 
that some trainings did overfit as no early stopping was applied. Technically, the 
networks did not get to see the same amount of examples during training even 
though the iterations are the same, as the chosen output sizes of each network 
were set differently. However, due to gradient accumulation, different learning 
rates and weight initialization it is hard to estimate the effect of the amount of 
labels seen. As all networks converged to a stable loss, it should negligible. 

7.2 Analysis on DSl 

7.2.1 Training 

The training parameters used on DSl were set to not use the error masking func¬ 
tionality. Masking usually gives thicker membrane (background) labels when 
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7.2. Analysis on DSl 


used with Softmax by balancing out the amount of background error against fore¬ 
ground (cell interior) error. 


Malis loss was run without using a patch prior for preferring training patches 
based on their label histogram. Softmax loss on the other hand was used with the 
patch prior enabled. This is justified by the different characters of the loss func¬ 
tions: Softmax computes a per-pixel error while Malis gives errors at problematic 
pixels only, which can be very concentrated on a few pixels (see Section 5.3.2|. 


In the patch pre-processing step, the images were enhanced with CLAHE (con¬ 
trast limited adaptive histogram equalization) and normalized in the range of 
[- 1 . 0 , 1 . 0 ]. 


To get more training data, the images were blurred with a randomly chosen 5 by 
5 Gaussian kernel. The training patches were also rotated to multiples of 90° and 
mirrored randomly (horizontal and vertical). 

Details of the pre-processing and label priors are described in Chapter The 
exact training parameters are stored as prototxt configuration in the Caffe Neural 
Models repository 0. 

Interestingly, it was not possible to start training of the SK network with Malis loss 
directly. The loss did not converge, and the output feature maps drifted to being 
classified as only foreground or only background. Therefore, the weights of the 
network in Malis onZy-training were initialized using 4000 iterations of Softmax 
training first. This was the lowest number of iterations where the training con¬ 
verged to a small loss afterwards. The other network architectures did not show 
such a behavior and trained well when starting with Malis directly. 

This is related to weight initialization as well as the fact that Malis works better on 
bigger training patch sizes. Training of SK was limited to 128 by 128 pixels output, 
which is much less context for Malis to work on than with USK and U with 512 
by 512 and 388 by 388 pixels respectively. 


7.2.2 Numerical 


Rank 

Network 

Loss Function 

Rand 

Warping 

Pixel 

1. 

USK 

Softmax + Malis 

0.031251499 

0.000423193 

0.040650392 

2. 

SK 

Softmax + Malis 

0.043871103 

0.000324726 

0.052724789 

3. 

USK 

Malis 

0.045955948 

0.000578165 

0.058407081 

4. 

U 

Softmax + Malis 

0.050799468 

0.000487328 

0.044159558 

5. 

SK 

Malis 

0.063852001 

0.000386953 

0.055678171 

6. 

USK 

Softmax 

0.067068677 

0.000418186 

0.030634890 

7. 

U 

Malis 

0.078736573 

0.000520945 

0.059127919 

8. 

u 

Softmax 

0.091736036 

0.000595331 

0.033451667 

9. 

SK 

Softmax 

0.123334489 

0.000549316 

0.034416625 


Table 7.1: DSl error evaluation (lower is better). 
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7.2. Analysis on DSl 


Explanations for the Tables 7.1 and 7.2 


Rank: The internal ranking, as indicated by the rand error. 

Loss Function: Softmax indicates lO'OOO iterations with the Softmax loss 
function. Malis indicates lO'OOO iterations with the Malis loss function. When 
both are indicated, the training was executed with lO'OOO Softmax and then 
lO'OOO Malis training iterations. 


Rand, warping and pixel: Error metrics, as proposed by Jain et al. p7) and 
used in the ISBI 2012 challenge 0 A script for evaluation is available for 
Fiji 1131 in the Caffe Neural Models repository 0. 


The ranking on data set DSl is as expected: Taking the average rank, USK-Net 
performs better than SK-Net, which is more precise than U-Net. 

The same goes for training methods: Using Softmax + Malis minimizes the rand 
error better than using Malis only, and Softmax outperforms Malis on pixel preci¬ 
sion. On one hand, this is because the Malis criterion improves the rand error by 
penalizing merge and split errors. On the other hand, training with Malis will also 
decrease the pixel accuracy, which can be seen when inspecting the result visually. 
As Softmax training initializes the networks better than Malis, the best training 
method is to start with Softmax and then transit to Malis for fine tuning. 
Obviously, the USK network architecture performed best on both pixel accuracy 
and rand error. Only for the warping error, the much slower SK-Net (see Section 


6.51 performs best. 


7.2.3 Visual 

The visual analysis is based on the image number 2 of the DSl stack (see Section 


2.11. It includes a glia cell (Figure 7.1 label A) which is considered as background. 
The thickness of it makes it harder to label correctly, especially with Malis loss. 



Figure 7.1: DSl ssTEM raw and corresponding ground truth, 1024 by 1024 pixels 

(Source: ssTEM 
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7.2. Analysis on DSl 



(c) USK network, Softmax. 




(d) SK network, Malis. 



(f) USK network, Malis. 


Figure 7.2: Visual comparison of all results in Table 


7.1 
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7.2. Analysis on DSl 



(a) SK network, Softmax + Malis. 

\v 


(b) U network, Softmax + Malis. 



(c) USK network, Softmax + Malis. 


Figure 7.2: Visual comparison of all results in Table 7.1 


Explanation to the labels in Figure |7.2[ 

(A) Glia cell that should be considered background. Malis loss does focus on 
separating foreground objects, so this part gets only labeled correctly in the 
trainings that use Softmax only (a, b and c). 

(B) Diffuse membrane with light texture, which is hard to separate from cell 
interior. The separation is correct only in SK + Malis (d), while all other 
configurations lead to cormected cells, although all trainings with Malis (d 
to i) show uncertainty, which could be sufficient to segment correctly. Diffuse 
parts are always on the edge of being pushed to foreground or background, 
which makes the segmentation fragile. 

(C) Membrane with close proximity to a mitochondrion. Except for U networks 
that have been trained with Softmax (b and h), this gets labeled correctly. 
The downsampling of U-Net is sub-optimal here when pixel error (without 
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7.3. Analysis on DS2 


scaling) instead of foreground separation is the training objective. 

(D) Same case as label C, but here the mitochondrion is labeled with high cer¬ 
tainty by all networks, leading to mislabeling of the nearby membrane. 

(E) Removing an isolated mitochondrion with high certainty. The USK network 
(c and i) performed best on this task. Malis trainings are usually worse, 
again because Malis does not focus on pixel accuracy. 

(F) Mitochondrion with long, thin structure. It consequently leads to a wrong 
classification as membrane because of its shape. This is not an issue, as 
isolated misclassifications within a cell can be rejected on a higher level when 
reconstructing the connectome. 

(G) Mitochondrion close to the image border. As the training and classification 
uses border mirroring to make up for the missing context, it can lead to more 
errors near the border, independent of the network architecture. 

(H) Same case as label G. Softmax + Malis on USK-Net performed best on re¬ 
moving the mitochondria on E, F, G and H. 

(I) Diffuse membrane with a dark texture. It is expected that this gets com¬ 
pletely classified as membrane. However, this was not the case on all net¬ 
works and trainings. The worst case applies on U-Net (b and h), where the 
cells are almost connected. 


As expected, membranes get labeled thinner with Malis as this error criterion 
is stopping to provide loss at membranes as soon as two cells are sufficiently 
separated (see Section 5.3.21. Even though the separation border is thin, training 
with Malis separates cells very well and scores best on both DSl (Table 7.11 and 
DS2 (Table [7l|. 

The visual results match the numerical evaluation. USK performs best on pixel 
error with Softmax and best on rand error with Softmax + Malis. However, when 
using Softmax, there is not a huge difference in pixel error and visual results 
between the networks. Malis does not work very well on SK networks (d), leaving 
a lot of uncertainty in the prediction. It can still be a good segmentation, as 
thresholding of gray scale values can be applied. The numeric evaluation [13| 
does test different thresholds, which is why the score is rather good (see Table |7.1| 
despite the uncertainty. 


7.3 Analysis on DS2 

7.3.1 Training 

Training on DS2 was similar to training on DSl. The main difference is that the 
patch prior was not used on both Softmax and Malis. Instead, the error masking 
was enabled when using Softmax, which gives thicker borders. This is motivated 
by having input images which are slightly more blurred and thus the cell mem¬ 
branes are less sharp and harder to distinguish from cell interior. 

Here, the SK network already converged using 2000 training iterations of Softmax 
before switching over to Malis. Starting with Malis directly was also not possible. 
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7.3. Analysis on DS2 


7.3.2 Numerical 


Rank 

Network 

Loss Function 

Rand 

Warping 

Pixel 

1. 

SK 

Softmax + Malis 

0.060110507 

0.000495529 

0.106053779 

2. 

USK 

Malis 

0.085927407 

0.000848007 

0.110390552 

3. 

SK 

Malis 

0.086975487 

0.000572968 

0.107365432 

4. 

SK 

Softmax 

0.087380844 

0.000585556 

0.075981492 

5. 

U 

Softmax + Malis 

0.097356122 

0.000940704 

0.101856259 

6. 

USK 

Softmax 

0.102450251 

0.000851440 

0.073163943 

7. 

U 

Malis 

0.121984927 

0.001038742 

0.111951817 

8. 

USK 

Softmax + Malis 

0.128440534 

0.000858688 

0.101919859 

9. 

U 

Softmax 

0.148045042 

0.001293564 

0.083922396 


Table 7.2: DS2 error evaluation (lower is better). 


Interestingly, the USK-Net with Softmax + Malis loss training performs unexpec¬ 
tedly worse on the dataset DS2 than on DSl, where it performed best. What is 
common for both DSl and DS2 is that the USK network combined with Softmax 
training performs best on the pixel error. Overall, this data set has a ranking 
much harder to explain than on DSl. Visual inspection reveals that the USK-Net 
was often overconfident when labeling the cell interior, which cormected cells that 
should be separated. On the DSl data set, this did not happen. 


Network 

Rand 

Warping 

Pixel 

U 

0.0382 

0.000353 

0.0611 


Table 7.3: (Source: Ronneberger et al. 0). 


Finally, the results of U-Net obtained by Ronneberger et al. 0 (Table |7.3| could 
not be reached, probably due to having used less transformations to extend the 
training data. It is not inherently clear if the U network would perform better than 
SK and USK given the bigger training data set. 

Ideas to improve training include: 

• Weight maps to scale the loss instead of only masking it. 

• Add elastic deformations, shifting and scaling instead of only rotation and 
mirroring to increase the amount of training data. 

• Experiment with different loss functions than Malis and Softmax, or altern¬ 
ate between them during training. 

7.3.3 Visual 

The visual analysis is based on the image number 16 of the DS2 test stack (see 
Section [Z2| . 
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7.3. Analysis on DS2 



Explanation to the labels in Figure |7.2[ 

The visual results of Softmax training (a to c) have thicker membrane labels than 
on DSl. This is because the error masking was enabled here and, as a result, 
the network has seen the same amount of error pixels for both foreground and 
background. 

(A) A bright spot, which is an error in the electron microscopy image. The 
membrane affected by it is misclassified by all networks and trainings. The 
local contrast enhancement with CLAHE did not help here. 

(B) Mitochondrion with close proximity parallel to a cell membrane. The USK 
network removes the membrane with all trainings (c, f and i). U-Net (b, 
e and h) performs a little better, but still merges the two cells. Only the 
SK network does it correctly (a, d and g), but has some uncertainty on the 
mitochondrion instead (d). 

(C) Diffuse section of a cell membrane. This is not an issue on most training/ar¬ 
chitecture combinations, except for U-Net with Softmax (b and h). 

(D) Oriented structures, even when faint, are partially labeled as membrane 
when sharp enough and of similar thickness as the membrane. This is no 
issue when isolated within the cell and not cutting through a cell that should 
be cormected. With convolutional networks, this is hard to impossible to la¬ 
bel correctly. It would require more high level knowledge of the object, such 
as if the predicted membrane is enclosing a cell or not. 
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7.3. Analysis on DS2 



(c) USK network, Softmax. 


(d) SK network, Malis. 



(e) U network, Malis 


(f) USK network, Malis. 


Figure 7.4: Visual comparison of all results in Table 


7.2 
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7.3. Analysis on DS2 



(a) SK network, Softmax + Malis. 


V' 

(b) U network, Softmax + Malis. 



(c) USK network, Softmax + Malis. 


Figure 7.4: Visual comparison of all results in Table 


7.2 


(E) Diffuse mitochondrion. The same situation as with B applies. Especially U- 
and USK-Net with Softmax training (b and c) get it wrong. 

(E) Diffuse cell interior that is similar to the membrane texture. All networks 
see a membrane connection through this area. In the training data there are 
some examples of diffuse membranes, so the networks have slightly overfit¬ 
ted on the training data for this case. 
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Chapter 8 


Conclusion 


8.1 Research Time Line 

An overview of the research time line, in order to give a context on what shaped 
the objectives and decisions made during the project: 


From 

To 

Activity / Event 

05.11.2014 

05.11.2014 

Collaboration request at UZH INI. 

09.11.2014 

22.01.2014 

Discussing ideas with Dr. Jan Funke. 

14.12.2014 

14.12.2014 

Hongsheng Li et al. paper released (SK kernels) |7|. 

07.02.2015 

07.02.2015 

Research proposal finished and accepted. 

23.02.2015 

23.02.2015 

Research beginning. 

26.02.2015 

06.03.2015 

Getting the sliding window network to work |10|. 

08.03.2015 

18.04.2015 

OpenCL backend development |5|. 

10.04.2015 

21.04.2015 

Discussing the project with AMD [251. 

22.04.2015 

22.04.2015 

Arrival of AMD's W9100 GPUs (hardware sponsoring). 

19.04.2015 

19.04.2015 

Pull request of the modified Gaffe to BVLG |4|. 

09.05.2015 

09.05.2015 

Public release of the Gaffe Neural Models [9). 

09.05.2015 

09.05.2015 

Public release of the Gaffe Neural Tool [16|. 

15.05.2015 

15.05.2015 

Ronneberger et al. paper released (U-Net) |2 . 

20.05.2015 

25.06.2015 

Testing of U-Net and design of USK-Net. 

27.06.2015 

12.07.2015 

Gollaboration at HHMI Janelia, Virginia, USA |28|. 

29.06.2015 

14.07.2015 

Implementing Malis loss and N-D SK kernels for Gaffe. 

12.07.2015 

15.07.2015 

Gritical source code development finished. 

13.07.2015 

20.08.2015 

Writing the report and final evaluation experiments. 

24.08.2015 


Post-research support of Project Greentea and ongoing 
development in collaboration with AMD, 

HHMI Janelia and the Gaffe community. 
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8.2. Implications 


8.2 Implications 

The first idea for the research project was to implement strided kernels. How¬ 
ever, with the release of the Hongsheng Li et al. 0 paper, the problem already 
got solved. We got their source code and I was able to translate existing sliding 
window networks to strided kernel networks. 

These events lead to a shift of focus to implement the OpenCL backend and sup¬ 
port a variety of hardware. This was important to see how existing CPU clusters 
and AMD GPUs could be used instead of only nVidia GPUs. 

A nice side effect of completely re-writing the whole Gaffe library to OpenGL 
was gaining a complete understanding of the library, the bottlenecks, how all 
layers work and what the most important objectives for optimizations and network 
design are. 

It turned out that running networks across devices does not give an advantage 
in the case of SK, U and USK architectures, as perfect scaling is possible when 
running independent instances of the network on each device. This only requires 
from the devices to have enough memory to hold the networks. This assumption 
was met when AMD's W9100 GPUs became available to me. 

SK networks did not scale as desired and up to lOO'OOO pixel classifications per 
second were only about 1/lOth of the desired speed. The original ideas to speed 
up the layers of the SK network by using methods such as multi device execution, 
Fourier transform convolutions or direct convolutions did not work. 

With the release of the Ronneberger et al. paper |[^, the research focus was shifted 
to analyzing the U-Net approach, which is able to classify up to one megapixel 
per second. 

Training of U-Net was more difficult than SK-Net and thus I tried to implement 
my own network architecture based on the findings of SK-Net and U-Net, which 
resulted in the experimental USK-Net. The USK-Net performs similarly to U-Net 
and produced better results with small training data sets (see Ghapterj^. 

Looking at ISBI 2012 results pT| and their test metrics, as well as the fact that one 
of the authors of the Mails criterion ||^, Dr. Srinivas Turaga, was at HHMI Janelia 
for collaboration, lead to the development of an additional loss layer for Gaffe (see 
Section [5.3.2| |. 

Finally, the last feature implemented before freezing the source code was N di¬ 
mensional strided kernel support for max pooling and convolution layers, as this 
was a feature requested by Dr. Stephan Saalfeld and Dr. Srinivas Turaga at HHMI 
Janelia. This can be used to run modified SK, U and USK network architectures 
on 3D blocks of volumetric-isotropic data sets, or even 3D over time (4D). 

8.3 Difficulties Encountered 

The obvious difficulty was to keep up with the general research in pixelwise clas¬ 
sification of images, as important papers 0, 0 were released during the project 
research. A shift of focus from the original plans were required a few times. This 
includes taking into account new results and dropping planned approaches. 
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8.4. Reproducibility of Results 


It was also a lot of work to keep up with the changes of the Caffe library Q, as they 
changed many core aspects such as network file format and shape specifications 
for memory blobs. This broke compability with existing code from Honghsheng 
Li et al.'s approach Q as well as the existing sliding window network p0| , |0. 
Constantly pulling new changes from the BVLC master branch |j^ and adapting 
my own branch to those changes was necessary. The benefit gained by doing this 
is that backwards compatibility to the official version is always guaranteed and 
that my own branch was ahead during the whole scope of the project. 

With programming the Caffe Neural Tool, the diversity of formats for labels and 
input data was complex to handle. Especially loading and storing TIF pictures 
that can have a variety of pixel formats and support stacking multiple images in a 
single file can be tricky. 

At last, it was not always obvious why a network does learn the expected features 
or not. Training parameter tests require up to ten hours of training on a GPU, 
which is very acceptable during production, but rather cumbersome during de¬ 
bugging. Evaluation and training of the networks for numeric results was only 
possible after freezing critical parts of the source code (computation kernels and 
layer implementations), because the results can differ after fixing bugs and other 
changes of the library. This resulted in having only two weeks left for this stage. 


8.4 Reproducibility of Results 


The results obtained in this report are guaranteed to be reproducible by the use of 
the following software pipeline, using CUDA or OpenCL hardware equivalent to 
the hardware used in this report. 

Repositories belonging to Project Greentea: 

• Caffe m 

URL: https://github.com/naibaf7/caffe 

Commit checksum: f84c2a4fb8d633bc7d8fc9771eb06a3cf 2215212 

• Caffe Neural Tool 

URL: https://github.com/naibaf7/caffe_neural_tool 
Commit checksum: 780e7dd72e4f 88c80729e2b33d6c0137d479016a 

• Caffe Neural Models 

URL: https://github.com/naibaf7/caffe_neural_models 
Commit checksum: dbb06d8352aa9c2ba99458cb9e9068500ebaccll 


As long as the ISBI 2012 challenge is ongoing, the data set DS2 results can be 
reproduced on their website |11|. The training and test data for the data set DSl 
remains included in the Caffe Neural Models repository. 
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8.5. Outlook 


8.5 Outlook 

In this outlook I give a brief introduction on future plans for the improved Caffe 
version ||^, extended use cases for the models missing features for the Caffe 
Neural Tool and ideas that did not fit into the time scope of the project. 


8.5.1 Device Abstracted Backend 


Currently, the OpenCL, CUDA and (legacy) CPU backend are implemented side- 
by-side and there is quite some code duplication in the Caffe library. To support 
future multi-device training methods and remove redundancy, the backend should 
be further unified so that the only remaining code duplication resides with the 
actual compute kernels used inside the layers. This will minimize bugs that occur 
on only one backend and make software verification much easier to handle. It will 
also shorten the time required for newly developed layers to become available on 
all devices. 

At the time of the project, the improved Caffe library 1^ drops full support on the 
legacy CPU backend in favor of an OpenCL hybrid solution (see Section 5.4.2| on 
CPUs. Some tuning to work correctly on NUMA processors (see Section \67) is 
still required. The CPU backend remains as a fallback for layers that do not work 
on OpenCL and CUDA, such as the Mails loss criterion (see Section 5.3.2|. 


8.5.2 Improving Training Data 

As a first step to improve results, all network architectures should be evaluated 
using more training data, acquired artificially or from more ground truth. 

It is advisable to try out all models on a given data set, as there is no clear winner 
among the networks. Results may vary strongly as the numerical analysis revealed 
(see Section [7!^ versus |7.3| l . 


8.5.3 Parameter Grid Search 

The network architectures presented here are only examples of a whole class of 
possible networks. Many parameters such as kernel sizes and how SK networks 
are combined with U type networks can be evaluated. Especially deep multi-path 
networks can be useful, merging the feature maps of different architectures before 
making the ouput label predictions. The new USK architecture is such an example. 


8.5.4 Testing of Volumetric Architectures 

Depending on the data set, SK, U and USK networks can be configured in many 
more ways for 3D than for 2D. For example, the depth direction is likely to have 
less physical resolution than the width and height dimension, due to how the data 
is acquired with slicing and electron microscopy. This should be considered when 
choosing the kernel size , kernel stride and span of the depth dimension. An 
example is the ISBI 2012 dataset which spans 2 x 2 x 1.5 microns with a resolution 
of 4 by 4 by 50 nm/pixel p^ , pT| , 1151. 
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8.6. Final Words 


8.5.5 Improving Test Metrics 

Visual inspection and the error metrics used in this report can only give informa¬ 
tion about how accurate the label predictions are compared to the ground truth. 
For connectomics, this may not be the most important objective. The tools which 
will further process the segmentations may be able to correct certain errors in the 
predictions by testing how likely the final result is, while other merge and split 
errors lead to uncorrectable errors. This is an objective that could be more useful 
when selecting the network architecture, loss function and training method. 

8.6 Final Words 

This research project combines many disciplines, such as high performance com¬ 
puting, machine learning, visual computing and a bit of cormectomics. I was able 
to implement most of the originally planned features and found replacements for 
ideas that turned out to work badly. Finally, the results of the research include 
a useful, versatile stack of Open Source software {Project Greentea) that can be ex¬ 
tended in the future. During working on the project, it was already possible to 
establish a growing user base p^ . The models and tools introduced with this pro¬ 
ject can be used efficiently on a large variety of data sets and hardware, making 
it very flexible. The collaboration at HHMI Janelia was also a great experience. 
Hardware sponsoring by AMD shows that programming of the Project Greentea 
and efficient machine learning libraries in general is of high interest also for hard¬ 
ware manufacturers. 
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Appendix A 


Network Architectures 
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A.l. SK-Net 


A.l SK-Net 



72 




































A.l. SK-Net 
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A.2. U-Net 
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A.2. U-Net 
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A.2. U-Net 
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A.2. U-Net 


kstride: 1 
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A.2. U-Net 
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A.2. U-Net 
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A.3. USK-Net 


A.3 USK-Net 
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A.3. USK-Net 
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A.3. USK-Net 
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A.3. USK-Net 
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A.3. USK-Net 
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